2025-05-08-12-04

The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete

Abstract

arXiv:2505.03961v1 Announce Type: new Abstract: According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.

摘要

尤瓦尔·赫拉利提出，大规模人类合作是由编码共同信念与价值观的共享叙事驱动的。本研究探讨此类叙事是否能类似地促进大语言模型智能体之间的协作。我们采用有限重复公共物品博弈框架，其中大语言模型智能体可选择合作性或利己性支出策略。通过向智能体注入不同程度强调团队合作的故事，我们测试了这种干预对谈判结果的影响。实验围绕四个核心问题展开：（1）叙事如何影响谈判行为？（2）智能体共享相同故事与不同故事时有何差异？（3）智能体数量增加时会产生什么变化？（4）智能体能否抵御自利型谈判者的影响？研究发现：基于故事的干预显著影响谈判策略与成功率。共同叙事能提升协作水平，使所有智能体获益；而注入不同故事则会产生相反效果，此时被植入利己倾向的智能体将占据优势。我们推测这些发现对多智能体系统设计与人工智能对齐研究具有启示意义。

Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents

Abstract

arXiv:2505.03947v1 Announce Type: new Abstract: One of the primary aspirations in reinforcement learning research is developing general-purpose agents capable of rapidly adapting to and mastering novel tasks. While RL gaming agents have mastered many Atari games, they remain slow and costly to train for each game. In this work, we demonstrate that latest reasoning LLMs with out-of-domain RL post-training can play a challenging Atari game called Frogger under a zero-shot setting. We then investigate the effect of in-context learning and the amount of reasoning effort on LLM performance. Lastly, we demonstrate a way to bootstrap traditional RL method with LLM demonstrations, which significantly improves their performance and sample efficiency. Our implementation is open sourced at https://github.com/AlienKevin/frogger.

摘要

强化学习研究的主要目标之一是开发能够快速适应并掌握新任务的通用智能体。尽管现有RL游戏智能体已能精通多种Atari游戏，但针对每个新游戏的训练过程仍显缓慢且成本高昂。本研究表明，经过跨领域强化学习后训练的最新推理型大语言模型（LLM）可在零样本设置下玩转名为《青蛙过河》的高难度Atari游戏。我们进一步探究了上下文学习效果及推理努力程度对LLM表现的影响。最后，我们提出一种利用LLM演示数据引导传统RL方法的技术，该方法显著提升了RL算法的性能与样本效率。项目代码已开源：https://github.com/AlienKevin/frogger。

MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models

Abstract

arXiv:2505.03906v1 Announce Type: new Abstract: Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO's web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.

摘要

大型语言模型（LLM）通过代码生成能力改变了软件开发方式，但其在高性能计算（HPC）领域的应用仍存在局限。HPC代码需要针对并行性、内存效率和特定架构优化的专门处理，而通用LLM往往忽视这些要素。本文提出MARCO（多智能体反应式代码优化器）——一种通过专业化多智能体架构增强LLM生成HPC代码的新型框架。MARCO采用代码生成与性能评估分离的智能体设计，通过反馈循环实现渐进式优化。其核心创新在于网络搜索组件，该组件能从最新会议论文集和研究文献中获取实时优化技术，弥补预训练LLM的知识缺口。基于LeetCode 75题集的全面评估表明：相较于单独使用Claude 3.5 Sonnet，MARCO实现了14.6%的平均运行时降低；而网络搜索组件的集成更使系统性能较基础版MARCO提升30.9%。这些结果证明多智能体系统在满足高性能代码生成特殊需求方面的潜力，为领域专用模型微调提供了经济高效的替代方案。

Abstract

arXiv:2505.04021v1 Announce Type: new Abstract: Serving large language models (LLMs) is expensive, especially for providers hosting many models, making cost reduction essential. The unique workload patterns of serving multiple LLMs (i.e., multi-LLM serving) create new opportunities and challenges for this task. The long-tail popularity of models and their long idle periods present opportunities to improve utilization through GPU sharing. However, existing GPU sharing systems lack the ability to adjust their resource allocation and sharing policies at runtime, making them ineffective at meeting latency service-level objectives (SLOs) under rapidly fluctuating workloads. This paper presents Prism, a multi-LLM serving system that unleashes the full potential of GPU sharing to achieve both cost efficiency and SLO attainment. At its core, Prism tackles a key limitation of existing systems $\unicode{x2014}$ the lack of $\textit{cross-model memory coordination}$ , which is essential for flexibly sharing GPU memory across models under dynamic workloads. Prism achieves this with two key designs. First, it supports on-demand memory allocation by dynamically mapping physical to virtual memory pages, allowing flexible memory redistribution among models that space- and time-share a GPU. Second, it improves memory efficiency through a two-level scheduling policy that dynamically adjusts sharing strategies based on models' runtime demands. Evaluations on real-world traces show that Prism achieves more than $2\times$ cost savings and $3.3\times$ SLO attainment compared to state-of-the-art systems.

摘要

大型语言模型(LLM)的服务成本高昂，尤其对托管多模型的提供商而言，降低成本至关重要。多LLM服务的独特工作负载模式为此任务带来了新机遇与挑战。模型的长尾流行特性及其长时间空闲状态为通过GPU共享提升利用率创造了条件。然而，现有GPU共享系统缺乏运行时调整资源分配与共享策略的能力，导致其在快速波动的工作负载下难以满足延迟服务等级目标(SLO)。本文提出Prism系统，通过充分释放GPU共享潜力实现成本效益与SLO达标双重目标。其核心解决了现有系统的关键缺陷——缺乏跨模型内存协调机制，该机制对动态负载下模型间灵活共享GPU内存至关重要。Prism通过两项关键设计实现这一目标：首先支持按需内存分配，通过动态映射物理与虚拟内存页，实现时空复用GPU的模型间灵活内存再分配；其次采用二级调度策略提升内存效率，根据模型运行时需求动态调整共享策略。真实场景测试表明，相较最先进系统，Prism可实现超过2倍的成本节约和3.3倍的SLO达标率提升。

LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration

Abstract

arXiv:2505.03985v1 Announce Type: new Abstract: Emergency response services are critical to public safety, with 9-1-1 call-takers playing a key role in ensuring timely and effective emergency operations. To ensure call-taking performance consistency, quality assurance is implemented to evaluate and refine call-takers' skillsets. However, traditional human-led evaluations struggle with high call volumes, leading to low coverage and delayed assessments. We introduce LogiDebrief, an AI-driven framework that automates traditional 9-1-1 call debriefing by integrating Signal-Temporal Logic (STL) with Large Language Models (LLMs) for fully-covered rigorous performance evaluation. LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines. It employs a three-step verification process: (1) contextual understanding to identify responder types, incident classifications, and critical conditions; (2) STL-based runtime checking with LLM integration to ensure compliance; and (3) automated aggregation of results into quality assurance reports. Beyond its technical contributions, LogiDebrief has demonstrated real-world impact. Successfully deployed at Metro Nashville Department of Emergency Communications, it has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement. Empirical evaluation with real-world data confirms its accuracy, while a case study and extensive user study highlight its effectiveness in enhancing call-taking performance.

摘要

紧急响应服务对公共安全至关重要，其中9-1-1接警员在确保应急行动及时有效方面发挥着关键作用。为保证接警操作的一致性，需通过质量评估对接警员技能进行持续优化。然而传统人工评估方式难以应对高呼叫量，导致覆盖率不足和评估延迟。本文提出LogiDebrief框架，通过将信号时序逻辑（STL）与大语言模型（LLM）相结合，实现9-1-1呼叫事后分析的自动化处理，完成全覆盖的严格绩效评估。该框架将接警规范转化为逻辑规约，支持基于流程指南的系统化呼叫评估，其三步验证流程包括：(1)通过上下文理解识别响应者类型、事件分类及危急状态；(2)结合LLM的STL运行时检查确保规程合规；(3)自动生成质量评估报告。除技术贡献外，该框架已在纳什维尔市应急通信部门成功部署，累计完成1,701次真实呼叫分析，节省311.85小时人工处理时间。实证研究证实其评估准确性，案例分析与大规模用户研究则验证了其在提升接警绩效方面的有效性。

QStore: Quantization-Aware Compressed Model Storage

Abstract

arXiv:2505.04081v1 Announce Type: new Abstract: Modern applications commonly leverage large, multi-modal foundation models. These applications often feature complex workflows that demand the storage and usage of similar models in multiple precisions. A straightforward approach is to maintain a separate file for each model precision (e.g., INT8, BF16), which is indeed the approach taken by many model providers such as HuggingFace and Ollama. However, this approach incurs excessive storage costs since a higher precision model (e.g., BF16) is a strict superset of a lower precision model (e.g., INT8) in terms of information. Unfortunately, simply maintaining only the higher-precision model and requiring every user to dynamically convert the model precision is not desirable because every user of lower precision models must pay the cost for model download and precision conversion. In this paper, we present QStore, a unified, lossless compression format for simultaneously storing a model in two (high and low) precisions efficiently. Instead of storing low-precision and high-precision models separately, QStore stores low-precision model and only the residual information needed to reconstruct high-precision models. The size of residual information is significantly smaller than the original high-precision models, thus achieving high savings in storage cost. Moreover, QStore does not compromise the speed of model loading. The low-precision models can be loaded quickly just like before. The high-precision models can also be reconstructed efficiently in memory by merging low-precision data and the residual with QStore's lightweight decoding logic. We evaluate QStore for compressing multiple precisions of popular foundation models, and show that QStore reduces overall storage footprint by up to 2.2x (45% of the original size) while enabling up to 1.7x and 1.8x faster model saving and loading versus existing approaches.

摘要

现代应用通常依赖于大型多模态基础模型。这些应用往往涉及复杂的工作流程，需要存储和使用多种精度的相似模型。常见的解决方案是为每种模型精度（如INT8、BF16）单独保存文件，这也是HuggingFace和Ollama等模型提供商采用的方法。然而，这种方法会导致存储成本过高，因为高精度模型（如BF16）在信息量上完全包含低精度模型（如INT8）。单纯只保存高精度模型并要求用户动态转换精度也不可行，因为所有低精度模型用户都必须承担模型下载和精度转换的开销。

本文提出QStore——一种高效存储高低双精度模型的无损统一压缩格式。QStore不再分别存储高低精度模型，而是保存低精度模型及重建高精度模型所需的残差信息。残差信息量远小于原始高精度模型，从而显著降低存储成本。此外，QStore不会影响模型加载速度：低精度模型可如常快速加载，高精度模型也能通过合并低精度数据与残差信息，配合QStore的轻量解码逻辑在内存中高效重建。我们对主流基础模型的多精度压缩进行测试，结果表明QStore最高可减少2.2倍存储空间（原大小的45%），同时模型保存和加载速度分别提升至现有方法的1.7倍和1.8倍。

Can Large Language Models Predict Parallel Code Performance?

Abstract

arXiv:2505.03988v1 Announce Type: new Abstract: Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware -- an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a roofline classification task: given the source code of a GPU kernel and the hardware specifications of a target GPU, can an LLM predict whether the GPU kernel is compute-bound or bandwidth-bound? For this study, we build a balanced dataset of 340 GPU kernels, obtained from HeCBench benchmark and written in CUDA and OpenMP, along with their ground-truth labels obtained via empirical GPU profiling. We evaluate LLMs across four scenarios: (1) with access to profiling data of the kernel source, (2) zero-shot with source code only, (3) few-shot with code and label pairs, and (4) fine-tuned on a small custom dataset. Our results show that state-of-the-art LLMs have a strong understanding of the Roofline model, achieving 100% classification accuracy when provided with explicit profiling data. We also find that reasoning-capable LLMs significantly outperform standard LLMs in zero- and few-shot settings, achieving up to 64% accuracy on GPU source codes, without profiling information. Lastly, we find that LLM fine-tuning will require much more data than what we currently have available. This work is among the first to use LLMs for source-level roofline performance prediction via classification, and illustrates their potential to guide optimization efforts when runtime profiling is infeasible. Our findings suggest that with better datasets and prompt strategies, LLMs could become practical tools for HPC performance analysis and performance portability.

摘要

准确评估并行GPU代码的性能通常需要在目标硬件上进行执行时间分析——由于高端GPU获取受限，这一步骤日益困难。本文探讨大型语言模型（LLMs）能否在不依赖硬件的情况下提供GPU性能预测的替代方案。我们将该问题构建为屋顶线分类任务：给定GPU内核的源代码和目标GPU的硬件规格，LLM能否预测该内核是计算受限还是带宽受限？

本研究构建了一个包含340个GPU内核的平衡数据集，这些内核来自HeCBench基准测试，采用CUDA和OpenMP编写，并通过实际GPU性能分析获得真实标签。我们在四种场景下评估LLMs：（1）提供内核源码的性能分析数据；（2）仅提供源代码的零样本学习；（3）提供代码-标签对的少样本学习；（4）在小规模定制数据集上微调。

结果表明，最先进的LLMs对屋顶线模型具有深刻理解，当提供明确性能分析数据时分类准确率达100%。我们还发现，具备推理能力的LLMs在零样本和少样本设置中显著优于标准LLMs，在不依赖性能分析信息的情况下，对GPU源代码的分类准确率最高可达64%。最后，我们发现LLM微调所需的数据量远超当前可用规模。

本研究首次利用LLMs通过分类实现源码级屋顶线性能预测，证明了其在无法进行运行时分析时指导优化工作的潜力。研究结果表明，通过更好的数据集和提示策略，LLMs有望成为高性能计算性能分析和性能可移植性的实用工具。

TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution

Abstract

arXiv:2505.04480v1 Announce Type: new Abstract: Trajectory prediction is a crucial task in modeling human behavior, especially in fields as social robotics and autonomous vehicle navigation. Traditional heuristics based on handcrafted rules often lack accuracy, while recently proposed deep learning approaches suffer from computational cost, lack of explainability, and generalization issues that limit their practical adoption. In this paper, we introduce TrajEvo, a framework that leverages Large Language Models (LLMs) to automatically design trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to generate and refine prediction heuristics from past trajectory data. We introduce a Cross-Generation Elite Sampling to promote population diversity and a Statistics Feedback Loop allowing the LLM to analyze alternative predictions. Our evaluations show TrajEvo outperforms previous heuristic methods on the ETH-UCY datasets, and remarkably outperforms both heuristics and deep learning methods when generalizing to the unseen SDD dataset. TrajEvo represents a first step toward automated design of fast, explainable, and generalizable trajectory prediction heuristics. We make our source code publicly available to foster future research at https://github.com/ai4co/trajevo.

摘要

轨迹预测是建模人类行为的关键任务，尤其在社交机器人和自动驾驶导航等领域。基于手工规则的传统启发式方法往往缺乏准确性，而近期提出的深度学习方法则存在计算成本高、可解释性不足以及泛化能力受限等问题，制约了其实际应用。本文提出TrajEvo框架，利用大语言模型（LLMs）自动设计轨迹预测启发式方法。该框架采用进化算法从历史轨迹数据中生成并优化预测启发式规则。我们提出跨代精英抽样策略以增强种群多样性，并建立统计反馈循环机制使LLM能够分析替代预测方案。评估结果表明，TrajEvo在ETH-UCY数据集上优于现有启发式方法，且在迁移至未见过的SDD数据集时，其表现显著超越启发式方法与深度学习方法。TrajEvo为快速、可解释且泛化性强的轨迹预测启发式方法的自动化设计迈出了第一步。我们已公开源代码以促进后续研究：https://github.com/ai4co/trajevo。

Benchmarking LLMs' Swarm intelligence

Abstract

arXiv:2505.04364v1 Announce Type: new Abstract: Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict constraints-such as limited local perception and communication, characteristic of natural swarms-remains largely unexplored, particularly concerning the nuances of swarm intelligence. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination that arise when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k x k view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Evaluating several leading LLMs in a zero-shot setting, we find significant performance variations across tasks, highlighting the difficulties posed by local information constraints. While some coordination emerges, results indicate limitations in robust planning and strategy formation under uncertainty in these decentralized scenarios. Assessing LLMs under swarm-like conditions is crucial for realizing their potential in future decentralized systems. We release SwarmBench as an open, extensible toolkit-built upon a customizable and scalable physical system with defined mechanical properties. It provides environments, prompts, evaluation scripts, and the comprehensive experimental datasets generated, aiming to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of Embodied MAS. Our code repository is available at https://github.com/x66ccff/swarmbench.

摘要

大语言模型（LLMs）在复杂推理方面展现出潜力，但其在多智能体系统（MAS）中面临严格约束（如自然群体特有的有限局部感知与通信）时，所表现出的涌现协调能力——尤其是群体智能的细微特征——仍亟待探索。现有基准测试往往未能充分体现智能体在时空信息不完整条件下进行分散式协调时产生的独特挑战。为此，我们提出SwarmBench：一个专为系统评估LLMs作为分散式智能体的群体智能能力而设计的新型基准测试。SwarmBench在可配置的2D网格环境中包含五项基础MAS协调任务，强制智能体主要依赖局部感官输入（k×k视野）和局部通信。我们提出了协调效能评估指标，并分析涌现的群体动态。通过对多个领先LLMs进行零样本评估，发现不同任务间存在显著性能差异，凸显了局部信息约束带来的挑战。虽然观察到部分协调行为，但结果表明这些分散场景下智能体在不确定性条件下的稳健规划与策略形成仍存在局限。在类群体条件下评估LLMs，对于实现其在未来分散式系统中的潜力至关重要。我们发布SwarmBench作为开放可扩展工具包——其基于具有明确力学特性的可定制化物理系统构建，提供环境配置、提示模板、评估脚本及完整实验数据集，旨在推动基于LLM的MAS协调与具身MAS理论基础的复现性研究。代码仓库详见https://github.com/x66ccff/swarmbench。

Abstract

arXiv:2505.03746v1 Announce Type: cross Abstract: Social media platforms enable instant and ubiquitous connectivity and are essential to social interaction and communication in our technological society. Apart from its advantages, these platforms have given rise to negative behaviors in the online community, the so-called cyberbullying. Despite the many works involving generative Artificial Intelligence (AI) in the literature lately, there remain opportunities to study its performance apart from zero/few-shot learning strategies. Accordingly, we propose an innovative and real-time solution for cyberbullying detection that leverages stream-based Machine Learning (ML) models able to process the incoming samples incrementally and Large Language Models (LLMS) for feature engineering to address the evolving nature of abusive and hate speech online. An explainability dashboard is provided to promote the system's trustworthiness, reliability, and accountability. Results on experimental data report promising performance close to 90 % in all evaluation metrics and surpassing those obtained by competing works in the literature. Ultimately, our proposal contributes to the safety of online communities by timely detecting abusive behavior to prevent long-lasting harassment and reduce the negative consequences in society.

摘要

社交媒体平台实现了即时且无处不在的连接，在我们这个技术社会中对于社交互动和沟通至关重要。尽管有诸多优势，这些平台也催生了在线社区中的负面行为，即所谓的网络欺凌。尽管近来文献中已有许多涉及生成式人工智能（AI）的研究，但除了零样本/少样本学习策略外，其性能仍有待探索。为此，我们提出了一种创新的实时网络欺凌检测解决方案，该方案利用基于流的机器学习（ML）模型（能够增量处理输入样本）和大型语言模型（LLM）进行特征工程，以应对在线侮辱性和仇恨言论的演变特性。我们还提供了一个可解释性仪表盘，以提升系统的可信度、可靠性和可问责性。实验数据结果显示，所有评估指标均接近90%，性能优异，且超越了文献中同类工作的成果。最终，我们的方案通过及时检测侮辱性行为来防止长期骚扰并减少社会负面影响，从而为在线社区的安全做出贡献。

APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design

Abstract

arXiv:2505.03748v1 Announce Type: cross Abstract: DNN accelerators, significantly advanced by model compression and specialized dataflow techniques, have marked considerable progress. However, the frequent access of high-precision partial sums (PSUMs) leads to excessive memory demands in architectures utilizing input/weight stationary dataflows. Traditional compression strategies have typically overlooked PSUM quantization, which may account for 69% of power consumption. This study introduces a novel Additive Partial Sum Quantization (APSQ) method, seamlessly integrating PSUM accumulation into the quantization framework. A grouping strategy that combines APSQ with PSUM quantization enhanced by a reconfigurable architecture is further proposed. The APSQ performs nearly lossless on NLP and CV tasks across BERT, Segformer, and EfficientViT models while compressing PSUMs to INT8. This leads to a notable reduction in energy costs by 28-87%. Extended experiments on LLaMA2-7B demonstrate the potential of APSQ for large language models. Code is available at https://github.com/Yonghao-Tan/APSQ.

摘要

深度神经网络（DNN）加速器在模型压缩和专用数据流技术的推动下取得了显著进展。然而，在采用输入/权重静态数据流的架构中，高精度部分和（PSUM）的频繁访问导致内存需求过高。传统压缩策略通常忽视PSUM量化，而这一环节可能占据69%的功耗。本研究提出了一种新颖的加法部分和量化（APSQ）方法，将PSUM累加无缝集成至量化框架中。进一步提出了一种分组策略，将APSQ与可重构架构增强的PSUM量化相结合。实验表明，APSQ在BERT、Segformer和EfficientViT模型的自然语言处理与计算机视觉任务上实现了近乎无损的INT8精度PSUM压缩，同时将能耗显著降低28-87%。针对LLaMA2-7B的扩展实验验证了APSQ在大型语言模型中的应用潜力。代码已开源：https://github.com/Yonghao-Tan/APSQ。

Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management

Abstract

arXiv:2505.03756v1 Announce Type: cross Abstract: Multiple Low-Rank Adapters (Multi-LoRAs) are gaining popularity for task-specific Large Language Model (LLM) applications. For multi-LoRA serving, caching hot KV caches and LoRA adapters in high bandwidth memory of accelerations can improve inference performance. However, existing Multi-LoRA inference systems fail to optimize serving performance like Time-To-First-Toke (TTFT), neglecting usage dependencies when caching LoRAs and KVs. We therefore propose FASTLIBRA, a Multi-LoRA caching system to optimize the serving performance. FASTLIBRA comprises a dependency-aware cache manager and a performance-driven cache swapper. The cache manager maintains the usage dependencies between LoRAs and KV caches during the inference with a unified caching pool. The cache swapper determines the swap-in or out of LoRAs and KV caches based on a unified cost model, when the HBM is idle or busy, respectively. Experimental results show that ELORA reduces the TTFT by 63.4% on average, compared to state-of-the-art works.

摘要

多低秩适配器（Multi-LoRAs）在任务特定的大语言模型（LLM）应用中日益普及。对于多LoRA服务，将热门的KV缓存和LoRA适配器缓存在加速器的高带宽内存中可以提高推理性能。然而，现有的多LoRA推理系统未能优化服务性能（如首次令牌时间，TTFT），在缓存LoRA和KV时忽略了使用依赖关系。因此，我们提出了FASTLIBRA，一种多LoRA缓存系统，以优化服务性能。FASTLIBRA包括一个依赖感知的缓存管理器和一个性能驱动的缓存交换器。缓存管理器在推理过程中通过统一的缓存池维护LoRA和KV缓存之间的使用依赖关系。缓存交换器基于统一的成本模型，分别在HBM空闲或繁忙时决定LoRA和KV缓存的换入或换出。实验结果表明，与最先进的工作相比，ELORA平均将TTFT降低了63.4%。

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Abstract

arXiv:2505.03745v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) {\Lambda}-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively reducing memory and bandwidth requirements while facilitating LLMs' long-sequence generation. At the hardware level, we design a dedicated FPGA-based accelerator with a reconfigurable computing engine to effectively and flexibly accommodate diverse operations arising from our compression algorithm, thereby fully translating the algorithmic innovations into tangible hardware efficiency. We validate AccLLM on the Xilinx Alveo U280 FPGA, demonstrating a 4.07x energy efficiency and a 2.98x throughput compared to the state-of-the-art work FlightLLM.

摘要

近年来，大语言模型（LLMs）在自然语言处理（NLP）领域取得了巨大成功，推动了将其部署从云端扩展到边缘设备的迫切需求。然而，在资源受限的边缘设备上部署LLMs面临重大挑战，包括：（1）密集的计算和庞大的模型规模，（2）自回归生成过程带来的高内存和带宽需求，以及（3）处理长序列时的有限可扩展性。为应对这些挑战，我们提出AccLLM，一种通过算法与硬件协同设计的综合加速框架，实现高效快速的长上下文LLM推理。在算法层面，我们整合了（1）剪枝，（2）Λ形注意力机制，以及（3）创新的W2A8KV4（2位权重、8位激活和4位KV缓存）量化方案，从而有效降低内存和带宽需求，同时提升LLMs的长序列生成能力。在硬件层面，我们设计了一款基于FPGA的专用加速器，配备可重构计算引擎，以高效灵活地适配压缩算法产生的多样化操作，从而将算法创新充分转化为实际的硬件效能。我们在Xilinx Alveo U280 FPGA上验证了AccLLM，相较于最先进的工作FlightLLM，能效提升4.07倍，吞吐量提高2.98倍。

GPU Performance Portability needs Autotuning

Abstract

arXiv:2505.03780v1 Announce Type: cross Abstract: As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today's reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art performance LLM execution without code changes. Focusing on flash attention -- a widespread performance-critical LLM kernel -- we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.

摘要

随着大型语言模型(LLM)复杂度不断提升，要实现最先进性能需要在算法、软件和硬件之间进行紧密协同设计。当前对单一主导平台的依赖限制了可移植性，造成供应商锁定，并抬高了新型AI硬件的准入门槛。本研究提出将即时(JIT)编译与内核参数自动调优相结合，无需修改代码即可实现可移植的、最先进性能的LLM执行。以广泛使用的性能关键型LLM内核——闪存注意力机制为例，我们证明该方法可探索多达15倍的参数配置组合，在多个维度上生成显著更多样化的代码，甚至能以最高230%的优势超越供应商优化实现，同时将内核代码量减少70倍并消除手工代码优化。研究结果表明，自动调优是解锁跨GPU供应商模型可移植性的一条重要途径。

Splitwiser: Efficient LM inference with constrained resources

Abstract

arXiv:2505.03763v1 Announce Type: cross Abstract: Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail to fully utilize compute resources, especially when compared to prompt computation phases. To address these challenges, we propose Splitwiser, a methodology that splits the two phases of an LLM inference request onto the same GPU, thereby reducing overhead and improving memory access and cache utilization. By eliminating the need to transfer data across devices, Splitwiser aims to minimize network-related overheads. In this report, we describe the basic structure of our proposed pipeline while sharing preliminary results and analysis. We implement our proposed multiprocessing design on two widely-used and independent LLM architectures: Huggingface and vLLM. We open-source our code for the respective implementations: 1) Huggingface (https://github.com/asad-aali/splitwiser), and 2) vLLM (https://github.com/adney11/vllm-sysml).

摘要

大型语言模型（LLM）的高效推理仍面临关键挑战，其包含两个主要阶段：计算密集型的提示计算阶段和内存密集型的令牌生成阶段。尽管现有批处理与调度技术已取得进展，但令牌生成阶段的计算资源利用率仍不足，尤其在对比提示计算阶段时表现明显。为解决这些问题，我们提出Splitwiser方法，该方法将LLM推理请求的两个阶段拆分至同一GPU上执行，从而降低开销并提升内存访问与缓存利用率。通过消除跨设备数据传输需求，Splitwiser旨在最小化网络相关开销。本报告阐述了所提出流水线的基本结构，并分享了初步实验结果与分析。我们在两种广泛使用且独立的LLM架构（Huggingface与vLLM）上实现了该多进程设计方案，相关代码已开源：1）Huggingface实现（https://github.com/asad-aali/splitwiser）；2）vLLM实现（https://github.com/adney11/vllm-sysml）。

Abstract

arXiv:2505.03788v1 Announce Type: cross Abstract: We introduce a novel approach for calibrating uncertainty quantification (UQ) tailored for multi-modal large language models (LLMs). Existing state-of-the-art UQ methods rely on consistency among multiple responses generated by the LLM on an input query under diverse settings. However, these approaches often report higher confidence in scenarios where the LLM is consistently incorrect. This leads to a poorly calibrated confidence with respect to accuracy. To address this, we leverage cross-modal consistency in addition to self-consistency to improve the calibration of the multi-modal models. Specifically, we ground the textual responses to the visual inputs. The confidence from the grounding model is used to calibrate the overall confidence. Given that using a grounding model adds its own uncertainty in the pipeline, we apply temperature scaling - a widely accepted parametric calibration technique - to calibrate the grounding model's confidence in the accuracy of generated responses. We evaluate the proposed approach across multiple multi-modal tasks, such as medical question answering (Slake) and visual question answering (VQAv2), considering multi-modal models such as LLaVA-Med and LLaVA. The experiments demonstrate that the proposed framework achieves significantly improved calibration on both tasks.

摘要

我们提出了一种针对多模态大语言模型（LLM）不确定性量化（UQ）校准的新方法。现有最先进的UQ方法依赖于LLM在不同设置下对输入查询生成多个响应的一致性。然而，这些方法在LLM持续出错的场景中往往会报告更高的置信度，导致置信度与准确率之间的校准效果不佳。为解决这一问题，我们在自一致性基础上引入跨模态一致性来改进多模态模型的校准。具体而言，我们将文本响应锚定于视觉输入，利用锚定模型的置信度来校准整体置信度。鉴于使用锚定模型会在流程中引入其自身的不确定性，我们采用温度缩放（一种广泛接受的参数化校准技术）来校准锚定模型对生成响应准确性的置信度。我们在多个多模态任务（如医学问答Slake和视觉问答VQAv2）上评估所提方法，测试模型包括LLaVA-Med和LLaVA等多模态模型。实验表明，该框架在两项任务上均实现了显著改进的校准效果。

Large Language Model Compression with Global Rank and Sparsity Optimization

Abstract

arXiv:2505.03801v1 Announce Type: cross Abstract: Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of existing methods. The first challenge relates to the interaction and cooperation between low-rank and sparse matrices, while the second involves determining weight allocation across different layers, as redundancy varies considerably among them. To address these challenges, we propose a novel two-stage LLM compression method with the capability of global rank and sparsity optimization. It is noteworthy that the overall optimization space is vast, making comprehensive optimization computationally prohibitive. Therefore, to reduce the optimization space, our first stage utilizes robust principal component analysis to decompose the weight matrices of LLMs into low-rank and sparse components, which span the low dimensional and sparse spaces containing the resultant low-rank and sparse matrices, respectively. In the second stage, we propose a probabilistic global optimization technique to jointly identify the low-rank and sparse structures within the above two spaces. The appealing feature of our approach is its ability to automatically detect the redundancy across different layers and to manage the interaction between the sparse and low-rank components. Extensive experimental results indicate that our method significantly surpasses state-of-the-art techniques for sparsification and composite approximation.

摘要

低秩与稀疏复合近似是压缩大语言模型（LLM）的自然思路。然而，该方法面临两大核心挑战，严重影响现有技术的性能：其一涉及低秩矩阵与稀疏矩阵的交互协作问题，其二在于不同网络层的权重分配策略，因其冗余度存在显著差异。针对这些挑战，我们提出一种具备全局秩与稀疏度优化能力的新型两阶段LLM压缩方法。值得注意的是，整体优化空间极为庞大，使得全局优化在计算上难以实现。为此，第一阶段采用鲁棒主成分分析将LLM权重矩阵分解为低秩分量与稀疏分量，二者分别生成包含结果矩阵的低维空间与稀疏空间。第二阶段提出概率化全局优化技术，在上述双空间中联合识别低秩与稀疏结构。本方法的突出优势在于能自动检测不同层级的冗余特征，并有效协调稀疏分量与低秩分量的相互作用。大量实验结果表明，该方法在稀疏化与复合近似任务上显著优于当前最先进技术。

LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection

Abstract

arXiv:2505.03793v1 Announce Type: cross Abstract: The proliferation of open-sourced Large Language Models (LLMs) and diverse downstream tasks necessitates efficient model selection, given the impracticality of fine-tuning all candidates due to computational constraints. Despite the recent advances in LLM selection, a fundamental research question largely remains nascent: how can we model the dynamic behaviors of LLMs during fine-tuning, thereby enhancing our understanding of their generalization performance across diverse downstream tasks? In this work, we propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs, thereby enabling accurate and efficient LLM selection for downstream applications. In particular, we first derive a Hessian-based PAC-Bayes generalization bound that unveils fine-tuning dynamics of LLMs and then introduce LENSLLM, a Neural Tangent Kernel(NTK)-based Rectified Scaling Model that enables accurate performance predictions across diverse tasks while maintaining computational efficiency. Extensive empirical results on 3 large-scale benchmarks demonstrate that our model achieves up to 91.1% accuracy and reduces up to 88.5% computational cost in LLM selection, outperforming 5 state-of-the-art methods. We open-source our proposed LENSLLM model and corresponding results at the Github link: https://github.com/Susan571/LENSLLM.git.

摘要

随着开源大型语言模型（LLMs）的激增和下游任务的多样化，在计算资源受限导致无法对所有候选模型进行微调的情况下，高效的模型选择变得至关重要。尽管近期LLM选择研究取得了进展，但一个核心科学问题仍处于萌芽阶段：如何建模LLMs在微调过程中的动态行为，从而深化我们对其在不同下游任务中泛化性能的理解？本研究提出了一种新颖的理论框架，为评估LLMs的泛化能力提供了有效视角，从而实现下游应用中精准高效的LLM选择。具体而言，我们首先推导出基于Hessian矩阵的PAC-Bayes泛化界，揭示了LLMs的微调动态特性；继而提出LENSLLM模型——一种基于神经正切核（NTK）的修正缩放模型，该模型能在保持计算效率的同时，精准预测跨任务性能表现。在3个大规模基准测试上的实验结果表明，我们的模型在LLM选择中最高可达91.1%的准确率，并降低88.5%的计算成本，性能优于5种最先进方法。我们已将LENSLLM模型及相关成果开源，GitHub链接：https://github.com/Susan571/LENSLLM.git。

Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth

Abstract

arXiv:2505.03802v1 Announce Type: cross Abstract: QLoRA effectively combines low-bit quantization and LoRA to achieve memory-friendly fine-tuning for large language models (LLM). Recently, methods based on SVD for continuous update iterations to initialize LoRA matrices to accommodate quantization errors have generally failed to consistently improve performance. Dynamic mixed precision is a natural idea for continuously improving the fine-tuning performance of quantized models, but previous methods often optimize low-rank subspaces or quantization components separately, without considering their synergy. To address this, we propose \textbf{QR-Adaptor}, a unified, gradient-free strategy that uses partial calibration data to jointly search the quantization components and the rank of low-rank spaces for each layer, thereby continuously improving model performance. QR-Adaptor does not minimize quantization error but treats precision and rank allocation as a discrete optimization problem guided by actual downstream performance and memory usage. Compared to state-of-the-art (SOTA) quantized LoRA fine-tuning methods, our approach achieves a 4.89% accuracy improvement on GSM8K, and in some cases even outperforms the 16-bit fine-tuned model while maintaining the memory footprint of the 4-bit setting.

摘要

QLoRA通过有效结合低位数量化和LoRA技术，实现了对大语言模型（LLM）的内存友好型微调。近期基于SVD的连续更新迭代方法虽尝试通过初始化LoRA矩阵来适应量化误差，但普遍未能持续提升性能。动态混合精度是持续改进量化模型微调性能的自然思路，但现有方法往往分别优化低秩子空间或量化组件，未考虑二者的协同效应。为此，我们提出 extbf{QR-Adaptor}——一种无需梯度的统一策略，利用部分校准数据联合搜索每层的量化组件和低秩空间秩数，从而持续提升模型性能。该方法不最小化量化误差，而是将精度与秩分配视为受实际下游性能和内存使用指导的离散优化问题。相比最先进（SOTA）的量化LoRA微调方法，我们的方案在GSM8K上实现了4.89%的准确率提升，某些情况下甚至优于16位微调模型，同时保持4位设置的内存占用。

RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization

Abstract

arXiv:2505.03803v1 Announce Type: cross Abstract: RWKV is a modern RNN architecture with comparable performance to Transformer, but still faces challenges when deployed to resource-constrained devices. Post Training Quantization (PTQ), which is a an essential technique to reduce model size and inference latency, has been widely used in Transformer models. However, it suffers significant degradation of performance when applied to RWKV. This paper investigates and identifies two key constraints inherent in the properties of RWKV: (1) Non-linear operators hinder the parameter-fusion of both smooth- and rotation-based quantization, introducing extra computation overhead. (2) The larger amount of uniformly distributed weights poses challenges for cluster-based quantization, leading to reduced accuracy. To this end, we propose RWKVQuant, a PTQ framework tailored for RWKV models, consisting of two novel techniques: (1) a coarse-to-fine proxy capable of adaptively selecting different quantization approaches by assessing the uniformity and identifying outliers in the weights, and (2) a codebook optimization algorithm that enhances the performance of cluster-based quantization methods for element-wise multiplication in RWKV. Experiments show that RWKVQuant can quantize RWKV-6-14B into about 3-bit with less than 1% accuracy loss and 2.14x speed up.

摘要

RWKV是一种性能与Transformer相当的现代循环神经网络架构，但在部署到资源受限设备时仍面临挑战。训练后量化（PTQ）作为减小模型规模和降低推理延迟的关键技术，已在Transformer模型中广泛应用。然而，该方法应用于RWKV时会出现显著的性能下降。本文通过研究揭示了RWKV固有特性的两个关键制约因素：（1）非线性算子阻碍了基于平滑和旋转量化的参数融合，引入了额外计算开销；（2）大量均匀分布的权重对基于聚类的量化方法构成挑战，导致精度下降。为此，我们提出RWKVQuant——专为RWKV模型设计的PTQ框架，包含两项创新技术：（1）通过评估权重均匀性并识别离群值，能自适应选择不同量化方法的粗细粒度代理；（2）针对RWKV中逐元素乘法运算，提升基于聚类的量化方法性能的码本优化算法。实验表明，RWKVQuant可将RWKV-6-14B量化为约3比特，在精度损失小于1%的同时实现2.14倍的加速。

Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling

Abstract

arXiv:2505.03799v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in various natural language processing tasks; however, their application to graph-related problems remains limited, primarily due to scalability constraints and the absence of dedicated mechanisms for processing graph structures. Existing approaches predominantly integrate LLMs with Graph Neural Networks (GNNs), using GNNs as feature encoders or auxiliary components. However, directly encoding graph structures within LLMs has been underexplored, particularly in the context of large-scale graphs where token limitations hinder effective representation. To address these challenges, we propose SDM-InstructGLM, a novel instruction-tuned Graph Language Model (InstructGLM) framework that enhances scalability and efficiency without relying on GNNs. Our method introduces a similarity-degree-based biased random walk mechanism, which selectively samples and encodes graph information based on node-feature similarity and degree centrality, ensuring an adaptive and structured representation within the LLM. This approach significantly improves token efficiency, mitigates information loss due to random sampling, and enhances performance on graph-based tasks such as node classification and link prediction. Furthermore, our results demonstrate the feasibility of LLM-only graph processing, enabling scalable and interpretable Graph Language Models (GLMs) optimized through instruction-based fine-tuning. This work paves the way for GNN-free approaches to graph learning, leveraging LLMs as standalone graph reasoning models. Our source code is available on GitHub.

摘要

大语言模型（LLMs）在各种自然语言处理任务中展现出强大能力，但其在图相关问题的应用仍受限于可扩展性约束及缺乏专门处理图结构的机制。现有方法主要将LLMs与图神经网络（GNNs）结合，以GNN作为特征编码器或辅助组件。然而，直接在LLMs中编码图结构的研究尚未深入，尤其在大规模图场景下，标记限制阻碍了有效表征。为解决这些挑战，我们提出SDM-InstructGLM——一种新型指令调优图语言模型（InstructGLM）框架，该框架在不依赖GNN的情况下提升可扩展性与效率。我们的方法引入基于相似度-度中心性的偏置随机游走机制，通过节点特征相似性和度中心性选择性采样并编码图信息，确保LLM内形成自适应结构化表征。该方法显著提升标记效率，缓解随机采样导致的信息损失，并增强节点分类和链接预测等图任务的性能。此外，实验结果验证了纯LLM图处理的可行性，通过基于指令的微调实现可扩展且可解释的图语言模型（GLMs）。本工作为免GNN的图学习开辟了新路径，推动LLMs作为独立图推理模型的应用。源代码已发布于GitHub。

Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free

Abstract

arXiv:2505.03810v1 Announce Type: cross Abstract: Large Language Models (LLMs) face deployment challenges due to high computational costs, and while Post-Training Quantization (PTQ) offers a solution, existing rotation-based methods struggle at very low bit-widths like 2-bit. We introduce a novel, training-free approach to construct an improved rotation matrix, addressing the limitations of current methods. The key contributions include leveraging the Walsh-Hadamard transform with sequency ordering, which clusters similar frequency components to reduce quantization error compared to standard Hadamard matrices, significantly improving performance. Furthermore, we propose a Grouped Sequency-arranged Rotation (GSR) using block-diagonal matrices with smaller Walsh blocks, effectively isolating outlier impacts and achieving performance comparable to optimization-based methods without requiring any training. Our method demonstrates robust performance on reasoning tasks and Perplexity (PPL) score on WikiText-2. Our method also enhances results even when applied over existing learned rotation techniques.

摘要

大型语言模型（LLMs）因高昂的计算成本面临部署挑战，而后训练量化（PTQ）虽提供解决方案，但现有基于旋转的方法在极低比特位宽（如2比特）下表现欠佳。我们提出一种无需训练的新方法，通过构建改进的旋转矩阵来解决现有技术的局限性。核心创新包括：采用按序数排列的沃尔什-哈达玛变换，相较于标准哈达玛矩阵，该变换能聚类相似频率分量以降低量化误差，从而显著提升性能；进一步提出分组序数排列旋转（GSR），利用包含小型沃尔什矩阵块的块对角矩阵，有效隔离异常值影响，在不依赖任何训练的情况下实现与基于优化的方法相媲美的性能。我们的方法在推理任务和WikiText-2数据集上的困惑度（PPL）指标均表现出鲁棒性能，即使应用于现有学习型旋转技术之上仍能提升效果。

Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

Abstract

arXiv:2505.03814v1 Announce Type: cross Abstract: As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the performance of large language models (LLMs). However, there is a lack of systematic analysis and guidance on determining the sufficiency of test data or selecting informative samples for evaluation. This paper introduces a certifiable and cost-efficient evaluation framework for LLMs. Our framework adapts to different evaluation objectives and outputs confidence intervals that contain true values with high probability. We use ``test sample complexity'' to quantify the number of test points needed for a certifiable evaluation and derive tight bounds on test sample complexity. Based on the developed theory, we develop a partition-based algorithm, named Cer-Eval, that adaptively selects test points to minimize the cost of LLM evaluation. Real-world experiments demonstrate that Cer-Eval can save 20% to 40% test points across various benchmarks, while maintaining an estimation error level comparable to the current evaluation process and providing a 95% confidence guarantee.

摘要

随着基础模型规模持续扩大，训练模型的体量呈指数级增长，这为其评估带来了重大挑战。当前评估实践通过构建日益庞大的数据集来评估大语言模型（LLMs）的性能。然而，在确定测试数据充分性或选择信息性评估样本方面，尚缺乏系统性分析和指导。本文提出一种可验证且高性价比的LLM评估框架。该框架能适应不同评估目标，并以高概率输出包含真实值的置信区间。我们采用"测试样本复杂度"来量化可验证评估所需的测试点数量，并推导出测试样本复杂度的紧致边界。基于所建立的理论，我们开发了一种基于分区的算法Cer-Eval，该算法能自适应选择测试点以最小化LLM评估成本。实际实验表明，Cer-Eval在各类基准测试中可节省20%至40%的测试点，同时保持与当前评估流程相当的估计误差水平，并提供95%的置信度保证。

MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance

Abstract

arXiv:2505.03804v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) large language models (LLMs), which leverage dynamic routing and sparse activation to enhance efficiency and scalability, have achieved higher performance while reducing computational costs. However, these models face significant memory overheads, limiting their practical deployment and broader adoption. Post-training quantization (PTQ), a widely used method for compressing LLMs, encounters severe accuracy degradation and diminished generalization performance when applied to MoE models. This paper investigates the impact of MoE's sparse and dynamic characteristics on quantization and identifies two primary challenges: (1) Inter-expert imbalance, referring to the uneven distribution of samples across experts, which leads to insufficient and biased calibration for less frequently utilized experts; (2) Intra-expert imbalance, arising from MoE's unique aggregation mechanism, which leads to varying degrees of correlation between different samples and their assigned experts. To address these challenges, we propose MoEQuant, a novel quantization framework tailored for MoE LLMs. MoE-Quant includes two novel techniques: 1) Expert-Balanced Self-Sampling (EBSS) is an efficient sampling method that efficiently constructs a calibration set with balanced expert distributions by leveraging the cumulative probabilities of tokens and expert balance metrics as guiding factors. 2) Affinity-Guided Quantization (AGQ), which incorporates affinities between experts and samples into the quantization process, thereby accurately assessing the impact of individual samples on different experts within the MoE layer. Experiments demonstrate that MoEQuant achieves substantial performance gains (more than 10 points accuracy gain in the HumanEval for DeepSeekMoE-16B under 4-bit quantization) and boosts efficiency.

摘要

专家混合（Mixture-of-Experts, MoE）大语言模型通过动态路由和稀疏激活机制提升效率与可扩展性，在降低计算成本的同时实现了更高性能。然而，此类模型面临显著的内存开销问题，限制了其实际部署与广泛应用。后训练量化（PTQ）作为大语言模型压缩的常用方法，在应用于MoE模型时会出现严重的精度下降与泛化性能衰减。本文研究了MoE的稀疏动态特性对量化的影响，发现两大核心挑战：（1）专家间不平衡，即样本在专家间分布不均，导致低频使用专家的校准不足且存在偏差；（2）专家内不平衡，源于MoE独特的聚合机制，使得不同样本与其分配专家间的关联程度存在差异。针对这些问题，我们提出专为MoE大语言模型设计的量化框架MoEQuant，其包含两项创新技术：1）专家平衡自采样（EBSS），通过利用词元累积概率和专家平衡指标作为引导因子，高效构建具有均衡专家分布的校准集；2）亲和力引导量化（AGQ），将专家与样本间的亲和关系纳入量化过程，从而精准评估单个样本对MoE层内不同专家的影响。实验表明，MoEQuant在4比特量化下为DeepSeekMoE-16B模型带来显著性能提升（HumanEval基准准确率增益超10分），同时有效提升效率。

Program Semantic Inequivalence Game with Large Language Models

Abstract

arXiv:2505.03818v1 Announce Type: cross Abstract: Large Language Models (LLMs) can achieve strong performance on everyday coding tasks, but they can fail on complex tasks that require non-trivial reasoning about program semantics. Finding training examples to teach LLMs to solve these tasks can be challenging. In this work, we explore a method to synthetically generate code reasoning training data based on a semantic inequivalence game SInQ: a generator agent creates program variants that are semantically distinct, derived from a dataset of real-world programming tasks, while an evaluator agent has to identify input examples that cause the original programs and the generated variants to diverge in their behaviour, with the agents training each other semi-adversarially. We prove that this setup enables theoretically unlimited improvement through self-play in the limit of infinite computational resources. We evaluated our approach on multiple code generation and understanding benchmarks, including cross-language vulnerability detection (Lu et al., 2021), where our method improves vulnerability detection in C/C++ code despite being trained exclusively on Python code, and the challenging Python builtin identifier swap benchmark (Miceli-Barone et al., 2023), showing that whereas modern LLMs still struggle with this benchmark, our approach yields substantial improvements. We release the code needed to replicate the experiments, as well as the generated synthetic data, which can be used to fine-tune LLMs.

摘要

大语言模型（LLMs）在日常编码任务中表现优异，但在需要复杂程序语义推理的任务上可能失效。寻找合适的训练样本来教导LLMs解决这类任务具有挑战性。本研究探索了一种基于语义不等价游戏SInQ的合成代码推理训练数据生成方法：生成器代理从真实编程任务数据集中创建语义不同的程序变体，而评估器代理则需识别导致原始程序与生成变体行为差异的输入示例，二者通过半对抗方式相互训练。我们证明在无限计算资源的理论极限下，这种设置可通过自我博弈实现无限制的性能提升。我们在多个代码生成与理解基准上评估了该方法，包括跨语言漏洞检测（Lu等，2021）——尽管仅使用Python代码训练，我们的方法仍提升了C/C++代码的漏洞检测能力；以及具有挑战性的Python内置标识符替换基准（Miceli-Barone等，2023），结果表明现代LLMs仍难以应对该基准，而我们的方法带来了显著改进。我们公开了实验复现代码及生成的合成数据，这些数据可用于微调LLMs。

VideoLLM Benchmarks and Evaluation: A Survey

Abstract

arXiv:2505.03829v1 Announce Type: cross Abstract: The rapid development of Large Language Models (LLMs) has catalyzed significant advancements in video understanding technologies. This survey provides a comprehensive analysis of benchmarks and evaluation methodologies specifically designed or used for Video Large Language Models (VideoLLMs). We examine the current landscape of video understanding benchmarks, discussing their characteristics, evaluation protocols, and limitations. The paper analyzes various evaluation methodologies, including closed-set, open-set, and specialized evaluations for temporal and spatiotemporal understanding tasks. We highlight the performance trends of state-of-the-art VideoLLMs across these benchmarks and identify key challenges in current evaluation frameworks. Additionally, we propose future research directions to enhance benchmark design, evaluation metrics, and protocols, including the need for more diverse, multimodal, and interpretability-focused benchmarks. This survey aims to equip researchers with a structured understanding of how to effectively evaluate VideoLLMs and identify promising avenues for advancing the field of video understanding with large language models.

摘要

大型语言模型（LLMs）的快速发展显著推动了视频理解技术的进步。本综述对专为视频大语言模型（VideoLLMs）设计或采用的基准测试与评估方法进行了全面分析。我们系统考察了当前视频理解基准测试的现状，探讨了其特性、评估方案及局限性。文章剖析了多种评估方法，包括闭集评估、开集评估以及针对时序与时空理解任务的专业化评估。我们重点展示了前沿VideoLLMs在这些基准测试中的性能趋势，并指出现有评估框架的关键挑战。此外，我们提出了提升基准设计、评估指标与协议的未来研究方向，包括对更具多样性、多模态性和可解释性导向的基准测试的需求。本综述旨在为研究者提供结构化认知，帮助其有效评估VideoLLMs，并指明利用大语言模型推进视频理解领域的潜在发展路径。

Memory Assisted LLM for Personalized Recommendation System

Abstract

arXiv:2505.03824v1 Announce Type: cross Abstract: Large language models (LLMs) have demonstrated significant potential in solving recommendation tasks. With proven capabilities in understanding user preferences, LLM personalization has emerged as a critical area for providing tailored responses to individuals. Current studies explore personalization through prompt design and fine-tuning, paving the way for further research in personalized LLMs. However, existing approaches are either costly and inefficient in capturing diverse user preferences or fail to account for timely updates to user history. To address these gaps, we propose the Memory-Assisted Personalized LLM (MAP). Through user interactions, we first create a history profile for each user, capturing their preferences, such as ratings for historical items. During recommendation, we extract relevant memory based on similarity, which is then incorporated into the prompts to enhance personalized recommendations. In our experiments, we evaluate MAP using a sequential rating prediction task under two scenarios: single domain, where memory and tasks are from the same category (e.g., movies), and cross-domain (e.g., memory from movies and recommendation tasks in books). The results show that MAP outperforms regular LLM-based recommenders that integrate user history directly through prompt design. Moreover, as user history grows, MAP's advantage increases in both scenarios, making it more suitable for addressing successive personalized user requests.

摘要

大语言模型（LLMs）在解决推荐任务方面展现出显著潜力。随着理解用户偏好能力的验证，LLM个性化已成为向个体提供定制化响应的关键研究领域。当前研究通过提示设计和微调探索个性化路径，为个性化LLMs的深入研究奠定了基础。然而，现有方法要么在捕捉多样化用户偏好时成本高昂且效率低下，要么未能考虑用户历史的及时更新。为弥补这些不足，我们提出记忆辅助个性化大语言模型（MAP）。通过用户交互，我们首先为每个用户创建历史档案，记录其偏好（如对历史项目的评分）。在推荐过程中，我们基于相似性提取相关记忆，并将其整合至提示中以增强个性化推荐效果。实验采用序列评分预测任务，在两种场景下评估MAP：单领域（记忆与任务同属一类，如电影）和跨领域（如记忆来自电影而推荐任务针对书籍）。结果表明，MAP优于通过提示设计直接整合用户历史的常规LLM推荐系统。此外，随着用户历史数据增长，MAP在两种场景下的优势均持续扩大，使其更适合处理连续个性化用户请求。

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype

Abstract

arXiv:2505.03853v1 Announce Type: cross Abstract: Predicting genetic perturbations enables the identification of potentially crucial genes prior to wet-lab experiments, significantly improving overall experimental efficiency. Since genes are the foundation of cellular life, building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations. However, current methods fail to fully leverage gene-related information, and solely rely on simple evaluation metrics to construct coarse-grained GRN. More importantly, they ignore functional differences between biotypes, limiting the ability to capture potential gene interactions. In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data, respectively, which serve as the initialization for gene representations. Additionally, we introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes, while capturing implicit gene relationships through graph structure learning (GSL). We propose GRAPE, a heterogeneous graph neural network (HGNN) that leverages gene representations initialized with features from descriptions and sequences, models the distinct roles of genes with different biotypes, and dynamically refines the GRN through GSL. The results on publicly available datasets show that our method achieves state-of-the-art performance.

摘要

预测遗传扰动能够在湿实验前识别潜在关键基因，从而显著提升整体实验效率。作为细胞生命的基础，构建基因调控网络（GRN）对于理解和预测遗传扰动效应至关重要。然而现有方法未能充分利用基因相关信息，仅依赖简单评估指标构建粗粒度GRN，更重要的是忽视了生物型之间的功能差异，限制了捕捉潜在基因相互作用的能力。本研究利用预训练大语言模型和DNA序列模型，分别从基因描述文本和DNA序列数据中提取特征作为基因表征的初始化。创新性地首次在遗传扰动研究中引入基因生物型信息，通过模拟不同生物型基因在调控细胞过程中的差异化作用，同时借助图结构学习（GSL）捕获隐含的基因关系。我们提出GRAPE这一异质图神经网络（HGNN），该网络融合描述与序列特征初始化的基因表征，建模不同生物型基因的独特作用，并通过GSL动态优化GRN。公开数据集上的实验结果表明，本方法达到了最先进的性能水平。

Advancing and Benchmarking Personalized Tool Invocation for LLMs

Abstract

arXiv:2505.04072v1 Announce Type: cross Abstract: Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at https://github.com/hyfshadow/PTBench.

An Empirical Study of OpenAI API Discussions on Stack Overflow

Abstract

arXiv:2505.04084v1 Announce Type: cross Abstract: The rapid advancement of large language models (LLMs), represented by OpenAI's GPT series, has significantly impacted various domains such as natural language processing, software development, education, healthcare, finance, and scientific research. However, OpenAI APIs introduce unique challenges that differ from traditional APIs, such as the complexities of prompt engineering, token-based cost management, non-deterministic outputs, and operation as black boxes. To the best of our knowledge, the challenges developers encounter when using OpenAI APIs have not been explored in previous empirical studies. To fill this gap, we conduct the first comprehensive empirical study by analyzing 2,874 OpenAI API-related discussions from the popular Q&A forum Stack Overflow. We first examine the popularity and difficulty of these posts. After manually categorizing them into nine OpenAI API-related categories, we identify specific challenges associated with each category through topic modeling analysis. Based on our empirical findings, we finally propose actionable implications for developers, LLM vendors, and researchers.

摘要

以OpenAI的GPT系列为代表的大型语言模型（LLM）快速发展，已显著影响自然语言处理、软件开发、教育、医疗、金融和科研等多个领域。然而，OpenAI API带来了不同于传统API的独特挑战，例如提示工程的复杂性、基于令牌的成本管理、非确定性输出以及黑箱操作特性。据我们所知，开发者使用OpenAI API时遇到的挑战尚未在现有实证研究中得到探讨。为填补这一空白，我们通过分析知名问答论坛Stack Overflow上2,874条OpenAI API相关讨论，开展了首次全面实证研究。首先评估了这些帖子的热度与难度，在人工将其归类为九个OpenAI API相关主题后，通过主题建模分析识别出每个类别对应的具体挑战。基于实证发现，我们最终为开发者、LLM供应商及研究者提出了可操作的改进建议。

LLMs' Suitability for Network Security: A Case Study of STRIDE Threat Modeling

Abstract

arXiv:2505.04101v1 Announce Type: cross Abstract: Artificial Intelligence (AI) is expected to be an integral part of next-generation AI-native 6G networks. With the prevalence of AI, researchers have identified numerous use cases of AI in network security. However, there are almost nonexistent studies that analyze the suitability of Large Language Models (LLMs) in network security. To fill this gap, we examine the suitability of LLMs in network security, particularly with the case study of STRIDE threat modeling. We utilize four prompting techniques with five LLMs to perform STRIDE classification of 5G threats. From our evaluation results, we point out key findings and detailed insights along with the explanation of the possible underlying factors influencing the behavior of LLMs in the modeling of certain threats. The numerical results and the insights support the necessity for adjusting and fine-tuning LLMs for network security use cases.

摘要

人工智能（AI）预计将成为下一代AI原生6G网络的核心组成部分。随着AI的普及，研究人员已识别出AI在网络安全中的众多应用场景。然而，目前几乎未有研究分析大语言模型（LLMs）在网络安全领域的适用性。为填补这一空白，我们探讨了LLMs在网络安全中的适用性，特别是通过STRIDE威胁建模的案例研究。我们采用四种提示技术结合五种LLMs对5G威胁进行STRIDE分类。根据评估结果，我们指出关键发现与详细见解，并解释可能影响LLMs在特定威胁建模中行为的潜在因素。数值结果与相关分析表明，有必要针对网络安全用例对LLMs进行调整与微调。

SLOT: Structuring the Output of Large Language Models

Abstract

arXiv:2505.04016v1 Announce Type: cross Abstract: Structured outputs are essential for large language models (LLMs) in critical applications like agents and information extraction. Despite their capabilities, LLMs often generate outputs that deviate from predefined schemas, significantly hampering reliable application development. We present SLOT (Structured LLM Output Transformer), a model-agnostic approach that transforms unstructured LLM outputs into precise structured formats. While existing solutions predominantly rely on constrained decoding techniques or are tightly coupled with specific models, SLOT employs a fine-tuned lightweight language model as a post-processing layer, achieving flexibility across various LLMs and schema specifications. We introduce a systematic pipeline for data curation and synthesis alongside a formal evaluation methodology that quantifies both schema accuracy and content fidelity. Our results demonstrate that fine-tuned Mistral-7B model with constrained decoding achieves near perfect schema accuracy (99.5%) and content similarity (94.0%), outperforming Claude-3.5-Sonnet by substantial margins (+25 and +20 percentage points, respectively). Notably, even compact models like Llama-3.2-1B can match or exceed the structured output capabilities of much larger proprietary models when equipped with SLOT, enabling reliable structured generation in resource-constrained environments.

摘要

结构化输出对于大型语言模型（LLMs）在智能体和信息抽取等关键应用中的部署至关重要。尽管LLMs具备强大能力，但其生成结果常偏离预定义模式，严重阻碍了可靠应用开发。我们提出SLOT（结构化LLM输出转换器），这是一种与模型无关的方法，可将非结构化LLM输出转换为精确的结构化格式。现有解决方案主要依赖约束解码技术或与特定模型强耦合，而SLOT采用微调的轻量级语言模型作为后处理层，实现了跨不同LLMs和模式规范的灵活性。我们提出包含数据整理与合成的系统化流程，以及量化模式准确性和内容保真度的形式化评估方法。实验表明，采用约束解码的微调Mistral-7B模型实现了近乎完美的模式准确率（99.5%）和内容相似度（94.0%），较Claude-3.5-Sonnet分别显著提升25和20个百分点。值得注意的是，即便是Llama-3.2-1B等紧凑模型，在配备SLOT后也能匹配或超越更大型商业模型的结构化输出能力，从而在资源受限环境中实现可靠的结构化生成。

X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains

Abstract

arXiv:2505.03981v1 Announce Type: cross Abstract: Recent proprietary models (e.g., o3) have begun to demonstrate strong multimodal reasoning capabilities. Yet, most existing open-source research concentrates on training text-only reasoning models, with evaluations limited to mainly mathematical and general-domain tasks. Therefore, it remains unclear how to effectively extend reasoning capabilities beyond text input and general domains. This paper explores a fundamental research question: Is reasoning generalizable across modalities and domains? Our findings support an affirmative answer: General-domain text-based post-training can enable such strong generalizable reasoning. Leveraging this finding, we introduce X-Reasoner, a vision-language model post-trained solely on general-domain text for generalizable reasoning, using a two-stage approach: an initial supervised fine-tuning phase with distilled long chain-of-thoughts, followed by reinforcement learning with verifiable rewards. Experiments show that X-Reasoner successfully transfers reasoning capabilities to both multimodal and out-of-domain settings, outperforming existing state-of-the-art models trained with in-domain and multimodal data across various general and medical benchmarks (Figure 1). Additionally, we find that X-Reasoner's performance in specialized domains can be further enhanced through continued training on domain-specific text-only data. Building upon this, we introduce X-Reasoner-Med, a medical-specialized variant that achieves new state of the art on numerous text-only and multimodal medical benchmarks.

摘要

近期专有模型（如o3）已开始展现出强大的多模态推理能力。然而，现有开源研究大多集中于训练纯文本推理模型，且评估主要局限于数学和通用领域任务。因此，如何有效将推理能力扩展到文本输入和通用领域之外仍不明确。本文探索了一个基础研究问题：推理能力是否具有跨模态和跨领域的泛化性？我们的研究给出了肯定答案：基于通用领域文本的后训练能够实现这种强泛化推理能力。基于这一发现，我们提出了X-Reasoner——一个仅通过通用领域文本后训练即可实现泛化推理的视觉语言模型，其采用两阶段方法：首阶段通过蒸馏长思维链进行监督微调，次阶段采用可验证奖励的强化学习。实验表明，X-Reasoner成功将推理能力迁移至多模态和领域外场景，在各类通用及医疗基准测试中（图1），其表现优于现有采用领域内和多模态数据训练的最先进模型。此外，我们发现通过持续训练领域专用纯文本数据，可进一步提升X-Reasoner在专业领域的性能。基于此，我们进一步提出医疗专用变体X-Reasoner-Med，该模型在多项纯文本和多模态医疗基准测试中创造了最新最优性能。

LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?

Abstract

arXiv:2505.04075v1 Announce Type: cross Abstract: This paper examines whether large language model (LLM) capabilities can continue to advance without additional compute by analyzing the development and role of algorithms used in state-of-the-art LLMs. Motivated by regulatory efforts that have largely focused on restricting access to high-performance hardware, we ask: Can LLMs progress in a compute-constrained environment, and how do algorithmic innovations perform under such conditions? To address these questions, we introduce a novel classification framework that distinguishes between compute-dependent innovations -- which yield disproportionate benefits at high compute levels (e.g., the Transformer architecture and mixture-of-experts models) and compute-independent innovations, which improve efficiency across all compute scales (e.g., rotary positional encoding, FlashAttention, or layer normalization). We quantify these contributions using a metric called compute-equivalent gain (CEG), which estimates the additional compute that would be required to achieve similar improvements without these algorithmic advancements. To validate this framework, we conduct small-scale training experiments with a scaled-down GPT-2 model. Our results confirm that compute-independent advancements yield meaningful performance gains even in resource-constrained settings, with a CEG of up to $3.5\times$ over a baseline model. By contrast, compute-dependent advancements provided little benefit or even degraded performance at the small scale, reinforcing the importance of compute availability for certain algorithmic gains.

摘要

本文通过分析当前最先进大型语言模型（LLM）所采用算法的发展与作用，探究在无需额外计算资源的情况下LLM能力能否持续提升。鉴于当前监管措施主要集中于限制高性能硬件的获取，我们提出核心问题：LLM能否在计算受限环境中取得进展？算法创新在此类条件下表现如何？

为解决这些问题，我们提出了一种新型分类框架，区分计算依赖型创新（如Transformer架构和专家混合模型——这些创新在高计算量级下产生不成比例的效益）与计算无关型创新（如旋转位置编码、FlashAttention或层归一化——这些创新在所有计算规模下均能提升效率）。我们采用"计算等效增益"（CEG）指标量化这些贡献，该指标估算了在没有算法进步的情况下，实现同等改进所需的额外计算量。

为验证该框架，我们使用缩小版GPT-2模型进行了小规模训练实验。结果表明：计算无关型进步在资源受限环境下仍能产生显著性能提升，其CEG最高可达基线模型的3.5倍；相比之下，计算依赖型进步在小规模场景中收益甚微甚至导致性能下降，这印证了计算资源可获得性对特定算法增益的关键作用。

On-Device LLM for Context-Aware Wi-Fi Roaming

Abstract

arXiv:2505.04174v1 Announce Type: cross Abstract: Wireless roaming is a critical yet challenging task for maintaining seamless connectivity in dynamic mobile environments. Conventional threshold-based or heuristic schemes often fail, leading to either sticky or excessive handovers. We introduce the first cross-layer use of an on-device large language model (LLM): high-level reasoning in the application layer that issues real-time actions executed in the PHY/MAC stack. The LLM addresses two tasks: (i) context-aware AP selection, where structured prompts fuse environmental cues (e.g., location, time) to choose the best BSSID; and (ii) dynamic threshold adjustment, where the model adaptively decides when to roam. To satisfy the tight latency and resource budgets of edge hardware, we apply a suite of optimizations-chain-of-thought prompting, parameter-efficient fine-tuning, and quantization. Experiments on indoor and outdoor datasets show that our approach surpasses legacy heuristics and DRL baselines, achieving a strong balance between roaming stability and signal quality. These findings underscore the promise of application-layer LLM reasoning for lower-layer wireless control in future edge systems.

摘要

无线漫游是动态移动环境中维持无缝连接的关键但具有挑战性的任务。传统基于阈值或启发式方案常因失效导致粘滞切换或过度切换。我们首次提出设备端大语言模型(LLM)的跨层应用：通过应用层高级推理生成PHY/MAC栈实时执行动作。该LLM处理两项任务：(i)上下文感知AP选择，通过结构化提示融合位置、时间等环境线索选择最优BSSID；(ii)动态阈值调整，模型自适应决策漫游时机。为满足边缘硬件严格的延迟与资源限制，我们采用思维链提示、参数高效微调及量化等优化组合。室内外数据集实验表明，本方法超越传统启发式与深度强化学习基线，在漫游稳定性和信号质量间实现良好平衡。这些发现彰显了应用层LLM推理在未来边缘系统中实现底层无线控制的潜力。

Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety

Abstract

arXiv:2505.04146v1 Announce Type: cross Abstract: Existing large language models (LLMs) are advancing rapidly and produce outstanding results in image generation tasks, yet their content safety checks remain vulnerable to prompt-based jailbreaks. Through preliminary testing on platforms such as ChatGPT, MetaAI, and Grok, we observed that even short, natural prompts could lead to the generation of compromising images ranging from realistic depictions of forged documents to manipulated images of public figures. We introduce Unmasking the Canvas (UTC Benchmark; UTCB), a dynamic and scalable benchmark dataset to evaluate LLM vulnerability in image generation. Our methodology combines structured prompt engineering, multilingual obfuscation (e.g., Zulu, Gaelic, Base64), and evaluation using Groq-hosted LLaMA-3. The pipeline supports both zero-shot and fallback prompting strategies, risk scoring, and automated tagging. All generations are stored with rich metadata and curated into Bronze (non-verified), Silver (LLM-aided verification), and Gold (manually verified) tiers. UTCB is designed to evolve over time with new data sources, prompt templates, and model behaviors. Warning: This paper includes visual examples of adversarial inputs designed to test model safety. All outputs have been redacted to ensure responsible disclosure.

摘要

现有大型语言模型（LLMs）发展迅速，在图像生成任务中表现优异，但其内容安全检查仍易受基于提示的越狱攻击。通过对ChatGPT、MetaAI和Grok等平台的初步测试，我们发现即使简短的自然提示也可能导致生成不良图像，包括伪造文件的逼真描绘和公众人物图像的篡改。我们提出"解蔽画布"（UTC基准；UTCB），这是一个动态可扩展的基准数据集，用于评估LLMs在图像生成中的脆弱性。该方法结合结构化提示工程、多语言混淆（如祖鲁语、盖尔语、Base64编码）以及基于Groq平台LLaMA-3的评估。该流程支持零样本提示与回退策略、风险评分及自动标记功能。所有生成结果均附带丰富元数据，并分为青铜（未验证）、白银（LLM辅助验证）和黄金（人工验证）三级。UTCB设计为可随时间演进，支持新数据源、提示模板和模型行为的整合。警告：本文包含用于测试模型安全性的对抗性输入视觉示例。所有输出内容均已经过脱敏处理以确保负责任披露。

Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering

Abstract

arXiv:2505.04251v1 Announce Type: cross Abstract: Multi-agent autonomous systems (MAS) are better at addressing challenges that spans across multiple domains than singular autonomous agents. This holds true within the field of software engineering (SE) as well. The state-of-the-art research on MAS within SE focuses on integrating LLMs at the core of autonomous agents to create LLM-based multi-agent autonomous (LMA) systems. However, the introduction of LMA systems into SE brings a plethora of challenges. One of the major challenges is the strategic allocation of tasks between humans and the LMA system in a trustworthy manner. To address this challenge, a RACI-based framework is proposed in this work in progress article, along with implementation guidelines and an example implementation of the framework. The proposed framework can facilitate efficient collaboration, ensure accountability, and mitigate potential risks associated with LLM-driven automation while aligning with the Trustworthy AI guidelines. The future steps for this work delineating the planned empirical validation method are also presented.

摘要

多智能体自主系统（MAS）在应对跨领域挑战方面优于单一自主智能体，这一优势在软件工程（SE）领域同样成立。当前SE领域关于MAS的前沿研究聚焦于将LLM作为自主智能体的核心组件，以构建基于LLM的多智能体自主（LMA）系统。然而，LMA系统在SE中的引入带来了诸多挑战，其中关键挑战在于如何以可信方式实现人类与LMA系统间的任务战略分配。针对这一挑战，本文提出了一种基于RACI的框架（该研究尚处于进行阶段），同时提供了实施指南及框架的示例实现。所提框架能够促进高效协作、确保责任明晰，并在符合可信AI准则的前提下降低LLM驱动自动化带来的潜在风险。本文还阐述了后续工作的实证验证方法规划。

VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning

Abstract

arXiv:2505.04192v1 Announce Type: cross Abstract: We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.

摘要

我们提出VideoPath-LLaVA，这是计算病理学领域首个整合三种不同图像场景（单张切片图像、自动关键帧提取的视频片段和人工分割的病理视频图像）的大型多模态模型（LMM），旨在模拟病理学家的自然诊断流程。通过生成详细的组织学描述并最终形成明确的签出诊断，该模型实现了视觉叙事与诊断推理的融合。

我们的方法核心是VideoPath-Instruct数据集，该数据集包含4278个从YouTube教育病理视频中提取的视频及诊断特定思维链指令对。尽管高质量数据对提升诊断推理能力至关重要，但其创建过程耗时且数量有限。为解决这一问题，我们迁移现有单图像指令数据集的知识，先在弱标注的关键帧提取视频片段上进行训练，再对人工分割视频进行微调。VideoPath-LLaVA为病理视频分析设立了新基准，并通过整合视觉与诊断推理，为未来支持临床决策的AI系统奠定了坚实基础。我们的代码、数据及模型已公开于https://github.com/trinhvg/VideoPath-LLaVA。

Weaponizing Language Models for Cybersecurity Offensive Operations: Automating Vulnerability Assessment Report Validation; A Review Paper

Abstract

arXiv:2505.04265v1 Announce Type: cross Abstract: This, with the ever-increasing sophistication of cyberwar, calls for novel solutions. In this regard, Large Language Models (LLMs) have emerged as a highly promising tool for defensive and offensive cybersecurity-related strategies. While existing literature has focused much on the defensive use of LLMs, when it comes to their offensive utilization, very little has been reported-namely, concerning Vulnerability Assessment (VA) report validation. Consequentially, this paper tries to fill that gap by investigating the capabilities of LLMs in automating and improving the validation process of the report of the VA. From the critical review of the related literature, this paper hereby proposes a new approach to using the LLMs in the automation of the analysis and within the validation process of the report of the VA that could potentially reduce the number of false positives and generally enhance efficiency. These results are promising for LLM automatization for improving validation on reports coming from VA in order to improve accuracy while reducing human effort and security postures. The contribution of this paper provides further evidence about the offensive and defensive LLM capabilities and therefor helps in devising more appropriate cybersecurity strategies and tools accordingly.

摘要

随着网络战复杂度的不断提升，亟需创新解决方案。在此背景下，大语言模型（LLMs）已成为网络安全攻防策略中极具前景的工具。现有研究多聚焦于LLMs的防御性应用，而对其攻击性用途——尤其是漏洞评估（VA）报告验证方面——的探讨则鲜有报道。为此，本研究通过探究LLMs在自动化及改进VA报告验证流程中的能力来填补这一空白。基于对相关文献的批判性综述，本文提出了一种利用LLMs实现VA报告自动化分析与验证的新方法，该方法有望减少误报率并提升整体效率。实验结果表明，LLMs自动化验证VA报告可有效提高准确性，同时降低人工成本并优化安全态势。本文的贡献在于进一步论证了LLMs在攻防两方面的能力，从而为制定更精准的网络安全策略和工具提供了理论依据。

Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering

Abstract

arXiv:2505.04260v1 Announce Type: cross Abstract: As large language models (LLMs) improve in their capacity to serve as personal AI assistants, their ability to output uniquely tailored, personalized responses that align with the soft preferences of their users is essential for enhancing user satisfaction and retention. However, untrained lay users have poor prompt specification abilities and often struggle with conveying their latent preferences to AI assistants. To address this, we leverage activation steering to guide LLMs to align with interpretable preference dimensions during inference. In contrast to memory-based personalization methods that require longer user history, steering is extremely lightweight and can be easily controlled by the user via an linear strength factor. We embed steering into three different interactive chatbot interfaces and conduct a within-subjects user study (n=14) to investigate how end users prefer to personalize their conversations. The results demonstrate the effectiveness of preference-based steering for aligning real-world conversations with hidden user preferences, and highlight further insights on how diverse values around control, usability, and transparency lead users to prefer different interfaces.

摘要

随着大型语言模型（LLMs）作为个人AI助手的能力不断提升，其输出高度定制化、符合用户隐性偏好的个性化回应能力，对于提升用户满意度和留存率至关重要。然而，未经训练的普通用户提示指定能力较差，往往难以向AI助手有效传达潜在偏好。为此，我们利用激活导向技术，在推理过程中引导LLM与可解释的偏好维度对齐。相较于需要较长用户历史记录的基于记忆的个性化方法，导向机制极为轻量级，用户可通过线性强度因子轻松控制。我们将导向机制嵌入三种不同的交互式聊天机器人界面，并开展了一项受试者内用户研究（n=14），以探索终端用户偏好的对话个性化方式。研究结果证实了基于偏好的导向机制在使真实对话与用户隐性偏好对齐方面的有效性，同时揭示了关于控制性、可用性和透明度等多元价值如何导致用户偏好不同界面的深层洞见。

To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay

Abstract

arXiv:2505.04209v1 Announce Type: cross Abstract: E-commerce sellers are recommended keyphrases based on their inventory on which they advertise to increase buyer engagement (clicks/sales). The relevance of advertiser keyphrases plays an important role in preventing the inundation of search systems with numerous irrelevant items that compete for attention in auctions, in addition to maintaining a healthy seller perception. In this work, we describe the shortcomings of training Advertiser keyphrase relevance filter models on click/sales/search relevance signals and the importance of aligning with human judgment, as sellers have the power to adopt or reject said keyphrase recommendations. In this study, we frame Advertiser keyphrase relevance as a complex interaction between 3 dynamical systems -- seller judgment, which influences seller adoption of our product, Advertising, which provides the keyphrases to bid on, and Search, who holds the auctions for the same keyphrases. This study discusses the practicalities of using human judgment via a case study at eBay Advertising and demonstrate that using LLM-as-a-judge en-masse as a scalable proxy for seller judgment to train our relevance models achieves a better harmony across the three systems -- provided that they are bound by a meticulous evaluation framework grounded in business metrics.

摘要

电子商务卖家会基于其库存获得关键词推荐，并通过这些关键词进行广告投放以提升买家参与度（点击量/销售额）。广告主关键词的相关性不仅对维持健康的卖家形象至关重要，还能有效防止搜索系统因大量不相关商品在竞价中争夺注意力而陷入过载。本研究揭示了基于点击/销售/搜索相关性信号训练广告主关键词相关性过滤模型的局限性，并强调了与人工判断保持一致的重要性——因为卖家有权采纳或拒绝此类关键词推荐。我们将广告主关键词相关性框架化为三个动态系统间的复杂交互：影响卖家产品采纳的卖家判断系统、提供竞价关键词的广告系统，以及主持相同关键词竞拍的搜索系统。通过eBay广告的实际案例，本研究探讨了利用人工判断的可行性，并证明以LLM（大语言模型）作为规模化代理来模拟卖家判断训练相关性模型，能在三个系统间实现更优的协同——前提是这些系统受限于基于商业指标的严谨评估框架。

A Large Language Model for Feasible and Diverse Population Synthesis

Abstract

arXiv:2505.04196v1 Announce Type: cross Abstract: Generating a synthetic population that is both feasible and diverse is crucial for ensuring the validity of downstream activity schedule simulation in activity-based models (ABMs). While deep generative models (DGMs), such as variational autoencoders and generative adversarial networks, have been applied to this task, they often struggle to balance the inclusion of rare but plausible combinations (i.e., sampling zeros) with the exclusion of implausible ones (i.e., structural zeros). To improve feasibility while maintaining diversity, we propose a fine-tuning method for large language models (LLMs) that explicitly controls the autoregressive generation process through topological orderings derived from a Bayesian Network (BN). Experimental results show that our hybrid LLM-BN approach outperforms both traditional DGMs and proprietary LLMs (e.g., ChatGPT-4o) with few-shot learning. Specifically, our approach achieves approximately 95% feasibility, significantly higher than the ~80% observed in DGMs, while maintaining comparable diversity, making it well-suited for practical applications. Importantly, the method is based on a lightweight open-source LLM, enabling fine-tuning and inference on standard personal computing environments. This makes the approach cost-effective and scalable for large-scale applications, such as synthesizing populations in megacities, without relying on expensive infrastructure. By initiating the ABM pipeline with high-quality synthetic populations, our method improves overall simulation reliability and reduces downstream error propagation. The source code for these methods is available for research and practical application.

摘要

生成既可行又多样化的合成人口对于确保基于活动的模型（ABM）中下游活动日程模拟的有效性至关重要。虽然变分自编码器和生成对抗网络等深度生成模型（DGM）已应用于此任务，但它们往往难以平衡包含罕见但合理的组合（即抽样零值）与排除不合理组合（即结构零值）之间的关系。为提高可行性同时保持多样性，我们提出了一种针对大语言模型（LLM）的微调方法，该方法通过从贝叶斯网络（BN）导出的拓扑顺序显式控制自回归生成过程。实验结果表明，我们的混合LLM-BN方法在少量样本学习情况下，性能优于传统DGM和专有LLM（如ChatGPT-4o）。具体而言，我们的方法实现了约95%的可行性，显著高于DGM中观察到的约80%，同时保持了相当的多样性，使其非常适合实际应用。重要的是，该方法基于轻量级开源LLM，可在标准个人计算环境中进行微调和推理，这使得该方法在大规模应用（如特大城市人口合成）中具有成本效益和可扩展性，而无需依赖昂贵的基础设施。通过以高质量合成人口启动ABM流程，我们的方法提高了整体模拟可靠性并减少了下游误差传播。这些方法的源代码可供研究和实际应用使用。

OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models

Abstract

arXiv:2505.04416v1 Announce Type: cross Abstract: Large language models (LLMs) trained over extensive corpora risk memorizing sensitive, copyrighted, or toxic content. To address this, we propose OBLIVIATE, a robust unlearning framework that removes targeted data while preserving model utility. The framework follows a structured process: extracting target tokens, building retain sets, and fine-tuning with a tailored loss function comprising three components -- masking, distillation, and world fact. Using low-rank adapters (LoRA), it ensures efficiency without compromising unlearning quality. We conduct experiments on multiple datasets, including the Harry Potter series, WMDP, and TOFU, using a comprehensive suite of metrics: forget quality (new document-level memorization score), model utility, and fluency. Results demonstrate its effectiveness in resisting membership inference attacks, minimizing the impact on retained data, and maintaining robustness across diverse scenarios.

摘要

在大规模语料库上训练的大型语言模型（LLMs）存在记忆敏感信息、受版权保护内容或有害内容的风险。为解决这一问题，我们提出OBLIVIATE——一个鲁棒的遗忘框架，可在保留模型效用的同时移除目标数据。该框架遵循结构化流程：提取目标标记、构建保留集，以及采用包含三个组件的定制损失函数进行微调——掩码、蒸馏和世界事实。通过使用低秩适配器（LoRA），该框架在保证遗忘质量的同时确保了效率。我们在多个数据集（包括《哈利·波特》系列、WMDP和TOFU）上进行了实验，采用了一套综合评估指标：遗忘质量（新文档级记忆分数）、模型效用和流畅性。结果表明，该方法能有效抵抗成员推理攻击，最大限度降低对保留数据的影响，并在多样场景中保持鲁棒性。

YABLoCo: Yet Another Benchmark for Long Context Code Generation

Abstract

arXiv:2505.04406v1 Announce Type: cross Abstract: Large Language Models demonstrate the ability to solve various programming tasks, including code generation. Typically, the performance of LLMs is measured on benchmarks with small or medium-sized context windows of thousands of lines of code. At the same time, in real-world software projects, repositories can span up to millions of LoC. This paper closes this gap by contributing to the long context code generation benchmark (YABLoCo). The benchmark featured a test set of 215 functions selected from four large repositories with thousands of functions. The dataset contained metadata of functions, contexts of the functions with different levels of dependencies, docstrings, functions bodies, and call graphs for each repository. This paper presents three key aspects of the contribution. First, the benchmark aims at function body generation in large repositories in C and C++, two languages not covered by previous benchmarks. Second, the benchmark contains large repositories from 200K to 2,000K LoC. Third, we contribute a scalable evaluation pipeline for efficient computing of the target metrics and a tool for visual analysis of generated code. Overall, these three aspects allow for evaluating code generation in large repositories in C and C++.

摘要

大型语言模型展现出解决各类编程任务的能力，包括代码生成。通常，LLM的性能是在包含数千行代码的中小型上下文窗口基准测试中进行评估的。然而在实际软件项目中，代码库可能达到数百万行代码量级。本文通过贡献长上下文代码生成基准(YABLoCo)来填补这一研究空白。该基准测试集包含从四个大型代码库中精选的215个函数，这些代码库均具有数千个函数规模。数据集包含函数元数据、具有不同依赖级别的函数上下文、文档字符串、函数体以及每个代码库的调用关系图。本文的贡献主要体现在三个关键方面：首先，该基准专注于C和C++这两种未被现有基准覆盖的语言在大型代码库中的函数体生成；其次，基准包含20万至200万行代码量级的大型代码库；第三，我们贡献了一个可扩展的评估流水线用于高效计算目标指标，以及一个生成代码可视化分析工具。这三个方面共同实现了对C/C++大型代码库代码生成能力的全面评估。

The Aloe Family Recipe for Open and Specialized Healthcare LLMs

Abstract

arXiv:2505.04388v1 Announce Type: cross Abstract: Purpose: With advancements in Large Language Models (LLMs) for healthcare, the need arises for competitive open-source models to protect the public interest. This work contributes to the field of open medical LLMs by optimizing key stages of data preprocessing and training, while showing how to improve model safety (through DPO) and efficacy (through RAG). The evaluation methodology used, which includes four different types of tests, defines a new standard for the field. The resultant models, shown to be competitive with the best private alternatives, are released with a permisive license. Methods: Building on top of strong base models like Llama 3.1 and Qwen 2.5, Aloe Beta uses a custom dataset to enhance public data with synthetic Chain of Thought examples. The models undergo alignment with Direct Preference Optimization, emphasizing ethical and policy-aligned performance in the presence of jailbreaking attacks. Evaluation includes close-ended, open-ended, safety and human assessments, to maximize the reliability of results. Results: Recommendations are made across the entire pipeline, backed by the solid performance of the Aloe Family. These models deliver competitive performance across healthcare benchmarks and medical fields, and are often preferred by healthcare professionals. On bias and toxicity, the Aloe Beta models significantly improve safety, showing resilience to unseen jailbreaking attacks. For a responsible release, a detailed risk assessment specific to healthcare is attached to the Aloe Family models. Conclusion: The Aloe Beta models, and the recipe that leads to them, are a significant contribution to the open-source medical LLM field, offering top-of-the-line performance while maintaining high ethical requirements. This work sets a new standard for developing and reporting aligned LLMs in healthcare.

摘要

目的：随着大型语言模型(LLM)在医疗领域的进步，亟需具有竞争力的开源模型以保障公共利益。本研究通过优化数据预处理和训练的关键阶段，同时展示如何提升模型安全性(通过DPO)和效能(通过RAG)，为开源医疗LLM领域作出贡献。采用的评估方法包含四种不同类型的测试，为该领域确立了新标准。最终发布的模型性能可与最佳私有替代方案竞争，并采用宽松许可协议。

方法：基于Llama 3.1和Qwen 2.5等强大基础模型，Aloe Beta使用定制数据集增强公共数据，添加合成思维链示例。模型通过直接偏好优化进行对齐，着重提升在越狱攻击情况下的伦理和政策合规表现。评估包含封闭式、开放式、安全性和人工测试，以最大化结果可靠性。

结果：基于Aloe系列模型的稳健表现，我们提出全流程优化建议。这些模型在医疗基准测试和各医学领域均展现竞争优势，并常获医疗专业人员青睐。在偏见和毒性方面，Aloe Beta模型显著提升安全性，对未见过的越狱攻击表现出强韧性。为负责任地发布，Aloe系列模型附有针对医疗领域的详细风险评估。

结论：Aloe Beta模型及其构建方法是对开源医疗LLM领域的重要贡献，在满足高标准伦理要求的同时提供顶尖性能。本研究为医疗领域对齐LLM的开发和报告设立了新标准。

"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

Abstract

arXiv:2505.04488v1 Announce Type: cross Abstract: The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.

摘要

视觉障碍人群，尤其是重度视障者，当前规模庞大，日常活动对他们构成重大挑战。尽管许多研究利用大语言模型和视觉语言模型辅助盲人，但多数聚焦静态内容，难以满足动态复杂环境（如日常活动）中的实时感知需求。为提供更有效的智能辅助，必须整合先进的视觉理解技术。虽然实时视觉与语音交互的VideoLLMs展现出强大的实时视觉理解能力，但此前尚无研究系统评估其在辅助视障者方面的有效性。本研究首次开展此类评估：首先构建涵盖视障辅助任务三大类别（基础技能、家庭生活任务、社会生活任务）的基准数据集VisAssistDaily。结果显示GPT-4o任务达成率最高；继而通过用户研究评估模型在封闭与开放场景中的表现，进一步探索VideoLLMs在辅助场景中的应用挑战。我们发现关键问题在于现有模型难以感知动态环境中的潜在危险，为此构建环境感知数据集SafeVid并引入轮询机制，使模型能主动检测环境风险。本研究希望为该领域未来工作提供有益启示。

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Abstract

arXiv:2505.04531v1 Announce Type: cross Abstract: Generative language modelling has surged in popularity with the emergence of services such as ChatGPT and Google Gemini. While these models have demonstrated transformative potential in productivity and communication, they overwhelmingly cater to high-resource languages like English. This has amplified concerns over linguistic inequality in natural language processing (NLP). This paper presents the first systematic review focused specifically on strategies to address data scarcity in generative language modelling for low-resource languages (LRL). Drawing from 54 studies, we identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering, across generative tasks. We also analyse trends in architecture choices, language family representation, and evaluation methods. Our findings highlight a strong reliance on transformer-based models, a concentration on a small subset of LRLs, and a lack of consistent evaluation across studies. We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems. Ultimately, this review aims to support researchers and developers in building inclusive AI tools for underrepresented languages, a necessary step toward empowering LRL speakers and the preservation of linguistic diversity in a world increasingly shaped by large-scale language technologies.

摘要

随着ChatGPT和Google Gemini等服务的出现，生成式语言建模的普及度急剧上升。尽管这些模型在提升生产力和沟通方面展现出变革性潜力，但其服务对象绝大多数是英语等高资源语言。这种现象加剧了人们对自然语言处理（NLP）领域语言不平等问题的担忧。本文首次针对低资源语言（LRL）生成式建模中的数据稀缺问题解决方案进行了系统性综述。基于54项研究，我们对包括单语数据增强、回译、多语言训练和提示工程等技术方法在生成任务中的应用进行了识别、分类和评估，同时分析了模型架构选择、语系分布和评估方法的发展趋势。研究发现：当前研究过度依赖基于Transformer的模型、集中于少数低资源语言、且缺乏统一的评估标准。最后，我们提出了将这些方法扩展到更广泛低资源语言的建议，并阐述了构建公平的生成式语言系统所面临的开放挑战。本综述旨在帮助研究者和开发者构建面向弱势语言的包容性人工智能工具，这是在语言技术日益影响世界的背景下，赋能低资源语言使用者并保护语言多样性的必要步骤。

Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization

Abstract

arXiv:2505.04578v1 Announce Type: cross Abstract: Reinforcement learning (RL) fine-tuning transforms large language models while creating a vulnerability we experimentally verify: Our experiment shows that malicious RL fine-tuning dismantles safety guardrails with remarkable efficiency, requiring only 50 steps and minimal adversarial prompts, with harmful escalating from 0-2 to 7-9. This attack vector particularly threatens open-source models with parameter-level access. Existing defenses targeting supervised fine-tuning prove ineffective against RL's dynamic feedback mechanisms. We introduce Reward Neutralization, the first defense framework specifically designed against RL fine-tuning attacks, establishing concise rejection patterns that render malicious reward signals ineffective. Our approach trains models to produce minimal-information rejections that attackers cannot exploit, systematically neutralizing attempts to optimize toward harmful outputs. Experiments validate that our approach maintains low harmful scores (no greater than 2) after 200 attack steps, while standard models rapidly deteriorate. This work provides the first constructive proof that robust defense against increasingly accessible RL attacks is achievable, addressing a critical security gap for open-weight models.

摘要

强化学习（RL）微调在优化大语言模型的同时，也引发了我们通过实验验证的安全漏洞：研究表明，恶意RL微调能以惊人效率破坏安全防护机制，仅需50步训练和少量对抗性提示即可使危害等级从0-2骤升至7-9。这种攻击方式尤其威胁具有参数级访问权限的开源模型。现有针对监督微调的防御措施对RL的动态反馈机制完全无效。我们提出"奖励中和"技术——首个专门防御RL微调攻击的框架，通过建立简洁的拒绝模式使恶意奖励信号失效。该方法训练模型生成攻击者无法利用的最小信息拒绝响应，系统性地中和优化有害输出的尝试。实验证实我们的方案在200次攻击步数后仍能保持低危害分数（不超过2），而标准模型则快速恶化。本研究首次以建设性证明表明，针对日益普及的RL攻击实现稳健防御是可行的，为开源权重模型填补了关键安全空白。

Context-aware LLM-based Safe Control Against Latent Risks

Abstract

arXiv:2403.11863v2 Announce Type: replace Abstract: Autonomous control systems face significant challenges in performing complex tasks in the presence of latent risks. To address this, we propose an integrated framework that combines Large Language Models (LLMs), numerical optimization, and optimization-based control to facilitate efficient subtask learning while ensuring safety against latent risks. The framework decomposes complex tasks into a sequence of context-aware subtasks that account for latent risks. These subtasks and their parameters are then refined through a multi-time-scale process: high-layer multi-turn in-context learning, mid-layer LLM Chain-of-Thought reasoning and numerical optimization, and low-layer model predictive control. The framework iteratively improves decisions by leveraging qualitative feedback and optimized trajectory data from lower-layer optimization processes and a physics simulator. We validate the proposed framework through simulated case studies involving robot arm and autonomous vehicle scenarios. The experiments demonstrate that the proposed framework can mediate actions based on the context and latent risks and learn complex behaviors efficiently.

摘要

自主控制系统在存在潜在风险的情况下执行复杂任务面临重大挑战。为此，我们提出一个集成框架，该框架结合了大型语言模型（LLMs）、数值优化和基于优化的控制方法，以促进高效子任务学习，同时确保对潜在风险的安全性。该框架将复杂任务分解为一系列考虑潜在风险的上下文感知子任务。这些子任务及其参数通过多时间尺度过程进行优化：高层采用多轮上下文学习，中层运用LLM思维链推理与数值优化，底层实施模型预测控制。通过整合来自下层优化过程和物理模拟器的定性反馈与优化轨迹数据，该框架能迭代改进决策。我们在机器人手臂和自动驾驶车辆的仿真案例研究中验证了所提框架。实验表明，该框架能够根据上下文和潜在风险调节动作，并高效学习复杂行为。

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

Abstract

arXiv:2505.04623v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.

摘要

多模态大语言模型（MLLMs）在文本、视觉和音频感知方面取得了进展，但在结构化跨模态推理（尤其是整合音频与视觉信号时）仍面临挑战。我们提出EchoInk-R1——一个基于强化学习的框架，用于增强MLLMs的此类推理能力。该框架以Qwen2.5-Omni-7B为基础模型，通过群组相对策略优化（GRPO）进行训练优化，专注于同步音频-图像配对的多选题问答任务。为此，我们构建了AVQA-R1-6K数据集，该数据集将音频-图像输入与源自OmniInstruct-v1的多选题配对。EchoInk-R1-7B在验证集上达到85.77%准确率，仅通过562步强化学习即超越基础模型（80.53%）。除准确性外，EchoInk-R1展现出反思推理能力：当面对模糊的多模态输入时，能重新审视初始解读并优化响应。这些结果表明，轻量级强化学习微调可有效提升MLLMs的跨模态推理能力。EchoInk-R1是首个通过强化学习统一音频、视觉与文本模态以实现开放世界通用推理的框架。代码与数据已开源以促进后续研究。

Towards a HIPAA Compliant Agentic AI System in Healthcare

Abstract

arXiv:2504.17669v2 Announce Type: replace Abstract: Agentic AI systems powered by Large Language Models (LLMs) as their foundational reasoning engine, are transforming clinical workflows such as medical report generation and clinical summarization by autonomously analyzing sensitive healthcare data and executing decisions with minimal human oversight. However, their adoption demands strict compliance with regulatory frameworks such as Health Insurance Portability and Accountability Act (HIPAA), particularly when handling Protected Health Information (PHI). This work-in-progress paper introduces a HIPAA-compliant Agentic AI framework that enforces regulatory compliance through dynamic, context-aware policy enforcement. Our framework integrates three core mechanisms: (1) Attribute-Based Access Control (ABAC) for granular PHI governance, (2) a hybrid PHI sanitization pipeline combining regex patterns and BERT-based model to minimize leakage, and (3) immutable audit trails for compliance verification.

摘要

以大型语言模型（LLMs）为核心推理引擎的自主人工智能系统，正在通过自主分析敏感医疗数据并在最小化人工干预下执行决策，改变医疗报告生成和临床总结等临床工作流程。然而，其应用需严格遵守《健康保险可携性和责任法案》（HIPAA）等监管框架，尤其是在处理受保护健康信息（PHI）时。本文提出一个符合HIPAA标准的自主人工智能框架，通过动态、上下文感知的策略执行来确保监管合规性。该框架整合了三个核心机制：（1）基于属性的访问控制（ABAC）以实现细粒度PHI治理，（2）结合正则表达式模式和基于BERT模型的混合PHI清理流程以最小化信息泄露，（3）不可篡改的审计追踪以进行合规性验证。

Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate

Abstract

arXiv:2502.12224v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, and their application in edge scenarios has attracted significant attention. However, sparse-activated Mixture-of-Experts (MoE) models, which are well suited for edge scenarios, have received relatively little attention due to their high memory demands. Offload-based methods have been proposed to address this challenge, but they face difficulties with expert prediction. Inaccurate expert predictions can result in prolonged inference delays. To promote the application of MoE models in edge scenarios, we propose Fate, an offloading system designed for MoE models to enable efficient inference in resource-constrained environments. The key insight behind Fate is that gate inputs from adjacent layers can be effectively used for expert prefetching, achieving high prediction accuracy without additional GPU overhead. Furthermore, Fate employs a shallow-favoring expert caching strategy that increases the expert hit rate to 99%. Additionally, Fate integrates tailored quantization strategies for cache optimization and IO efficiency. Experimental results show that, compared to Load on Demand and Expert Activation Path-based method, Fate achieves up to 4.5x and 1.9x speedups in prefill speed and up to 4.1x and 2.2x speedups in decoding speed, respectively, while maintaining inference quality. Moreover, Fate's performance improvements are scalable across different memory budgets.

摘要

大型语言模型（LLMs）在各种任务中展现出卓越性能，其在边缘计算场景的应用备受关注。然而，特别适合边缘场景的稀疏激活混合专家（MoE）模型，由于高内存需求而相对缺乏研究。现有基于卸载的方法虽试图解决该问题，但面临专家预测困难——预测不准确会导致推理延迟显著增加。为促进MoE模型在边缘场景的应用，我们提出Fate：一个面向MoE模型的卸载系统，可在资源受限环境中实现高效推理。Fate的核心思想是利用相邻层的门控输入实现专家预取，在不增加GPU开销的前提下达成高预测准确率。该系统还采用偏向浅层专家的缓存策略，将专家命中率提升至99%。此外，Fate集成定制化量化策略以优化缓存和IO效率。实验表明：相较于按需加载和基于专家激活路径的方法，Fate在预填充阶段分别实现最高4.5倍和1.9倍加速，在解码阶段分别获得最高4.1倍和2.2倍加速，且保持推理质量不变。值得注意的是，Fate的性能提升在不同内存预算下均具有可扩展性。

Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

Abstract

arXiv:2308.15022v3 Announce Type: replace-cross Abstract: Recently, large language models (LLMs), such as GPT-4, stand out remarkable conversational abilities, enabling them to engage in dynamic and contextually relevant dialogues across a wide range of topics. However, given a long conversation, these chatbots fail to recall past information and tend to generate inconsistent responses. To address this, we propose to recursively generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability. Specifically, our method first stimulates LLMs to memorize small dialogue contexts and then recursively produce new memory using previous memory and following contexts. Finally, the chatbot can easily generate a highly consistent response with the help of the latest memory. We evaluate our method on both open and closed LLMs, and the experiments on the widely-used public dataset show that our method can generate more consistent responses in a long-context conversation. Also, we show that our strategy could nicely complement both long-context (e.g., 8K and 16K) and retrieval-enhanced LLMs, bringing further long-term dialogue performance. Notably, our method is a potential solution to enable the LLM to model the extremely long context. The code and scripts will be released later.

摘要

近年来，以GPT-4为代表的大语言模型（LLMs）展现出卓越的对话能力，能够在广泛主题中实现动态且符合语境的交流。然而，面对长对话时，这些聊天机器人难以回忆过往信息，且易生成不一致的响应。为此，我们提出通过大语言模型递归生成摘要/记忆以增强长期记忆能力。具体而言，该方法首先激发大语言模型记忆小段对话上下文，随后基于先前记忆与后续上下文递归生成新记忆。最终，聊天机器人可借助最新记忆轻松生成高度一致的响应。我们在开源与闭源大语言模型上评估了该方法，基于广泛使用的公开数据集的实验表明，该方法能在长上下文对话中生成更一致的响应。同时，我们证明该策略可有效补充长上下文（如8K和16K）模型与检索增强型大语言模型，进一步提升长期对话性能。值得注意的是，本方法为大语言模型建模极长上下文提供了潜在解决方案。代码与脚本将于后续发布。

Question-Answering Dense Video Events

Abstract

arXiv:2409.04388v4 Announce Type: replace-cross Abstract: This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a remarkable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA~and NExT-GQA, respectively. Our data and code will be released upon acceptance.

摘要

本文提出密集视频事件问答这一新颖任务，旨在回答长视频中的密集事件问题并定位相关片段，从而挑战多模态大语言模型（MLLMs）对长时间跨度的多事件进行忠实理解和推理的能力。为促进研究，我们构建了DeVE-QA数据集，包含10.6K个长视频中26K个事件相关的78K个问题。基准测试表明当前最先进的MLLMs在DeVE-QA上表现欠佳。为此，我们提出无需训练的DeVi方法，其创新性体现在：通过分层描述模块检测事件，通过时序事件记忆模块实现事件情境化与记忆，通过自一致性校验模块定位密集事件以进行问答。大量实验证明DeVi在密集事件问答和视频片段定位方面表现优异。相较于现有MLLMs，该方法在DeVE-QA和NExT-GQA数据集上的G(round)QA准确率分别显著提升4.8%和2.1%。我们的数据与代码将在论文录用后公开。

Estimating LLM Uncertainty with Logits

Abstract

arXiv:2502.00290v4 Announce Type: replace-cross Abstract: Over the past few years, Large Language Models (LLMs) have developed rapidly and are widely applied in various domains. However, LLMs face the issue of hallucinations, generating responses that may be unreliable when the models lack relevant knowledge. To be aware of potential hallucinations, uncertainty estimation methods have been introduced, and most of them have confirmed that reliability lies in critical tokens. However, probability-based methods perform poorly in identifying token reliability, limiting their practical utility. In this paper, we reveal that the probability-based method fails to estimate token reliability due to the loss of evidence strength information which is accumulated in the training stage. Therefore, we present Logits-induced token uncertainty (LogTokU), a framework for estimating decoupled token uncertainty in LLMs, enabling real-time uncertainty estimation without requiring multiple sampling processes. We employ evidence modeling to implement LogTokU and use the estimated uncertainty to guide downstream tasks. The experimental results demonstrate that LogTokU has significant effectiveness and promise.

摘要

过去几年间，大型语言模型（LLMs）发展迅速并广泛应用于各领域。然而，LLMs存在幻觉问题，当模型缺乏相关知识时可能生成不可靠的响应。为识别潜在幻觉，不确定性估计方法被引入，其中多数研究证实可靠性关键取决于特定标记。但基于概率的方法在识别标记可靠性方面表现欠佳，限制了其实用价值。本文揭示概率方法失效的原因在于丢失了训练阶段积累的证据强度信息，据此提出Logits诱导的标记不确定性框架（LogTokU），该框架可实现LLMs中解耦标记不确定性的实时估计，无需多次采样过程。我们采用证据建模实现LogTokU，并利用估计的不确定性指导下游任务。实验结果表明LogTokU具有显著有效性和应用前景。

CLEAR: Cue Learning using Evolution for Accurate Recognition Applied to Sustainability Data Extraction

Abstract

arXiv:2501.18504v3 Announce Type: replace-cross Abstract: Large Language Model (LLM) image recognition is a powerful tool for extracting data from images, but accuracy depends on providing sufficient cues in the prompt - requiring a domain expert for specialized tasks. We introduce Cue Learning using Evolution for Accurate Recognition (CLEAR), which uses a combination of LLMs and evolutionary computation to generate and optimize cues such that recognition of specialized features in images is improved. It achieves this by auto-generating a novel domain-specific representation and then using it to optimize suitable textual cues with a genetic algorithm. We apply CLEAR to the real-world task of identifying sustainability data from interior and exterior images of buildings. We investigate the effects of using a variable-length representation compared to fixed-length and show how LLM consistency can be improved by refactoring from categorical to real-valued estimates. We show that CLEAR enables higher accuracy compared to expert human recognition and human-authored prompts in every task with error rates improved by up to two orders of magnitude and an ablation study evincing solution concision.

摘要

大语言模型（LLM）图像识别是从图像中提取数据的强大工具，但其准确性依赖于提示中提供充分线索——这需要领域专家完成专业任务。我们提出了基于进化计算的精准识别线索学习法（CLEAR），该方法结合LLM与进化计算来生成并优化线索，从而提升图像中专业特征的识别能力。其核心机制是自动生成新颖的领域特定表征，继而通过遗传算法优化合适的文本线索。我们将CLEAR应用于从建筑物内外图像识别可持续发展数据的现实任务，探究了变长表征相较定长表征的效果，并演示如何通过将分类估计重构为实值估计来提升LLM的一致性。实验表明：在所有任务中，CLEAR的识别准确率均超越人类专家和人工编写的提示，错误率最高降低两个数量级；消融研究证实了解决方案的简洁性。

A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification

Abstract

arXiv:2504.18884v2 Announce Type: replace-cross Abstract: With the advance of large language models (LLMs), LLMs have been utilized for the various tasks. However, the issues of variability and reproducibility of results from each trial of LLMs have been largely overlooked in existing literature while actual human annotation uses majority voting to resolve disagreements among annotators. Therefore, this study introduces the straightforward ensemble strategy to a sentiment analysis using LLMs. As the results, we demonstrate that the ensemble of multiple inference using medium-sized LLMs produces more robust and accurate results than using a large model with a single attempt with reducing RMSE by 18.6%.

摘要

随着大型语言模型（LLMs）的发展，LLMs已被广泛应用于各类任务。然而，现有研究大多忽视了LLMs每次试验结果的变异性和可复现性问题，而实际人工标注通常采用多数投票机制来解决标注者间的分歧。为此，本研究提出了一种基于LLMs的情感分析简单集成策略。实验结果表明，通过多次推理集成中等规模LLMs所获得的结果，比单次使用大型模型更具鲁棒性和准确性，均方根误差（RMSE）降低了18.6%。

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

Abstract

arXiv:2412.11026v2 Announce Type: replace-cross Abstract: Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <Subject-Predicate-Object> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video's spatio-temporal information. To further improve the LLM's ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM's reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.

摘要

动态场景蕴含复杂的时空信息，这对移动机器人、无人机和自动驾驶系统做出明智决策至关重要。由于时空复杂度的波动性，将这些场景解析为<主体-谓词-客体>语义三元组以实现精确的场景图生成（SGG）具有高度挑战性。受大型语言模型（LLMs）推理能力的启发，我们提出SceneLLM——一个创新框架，利用LLMs作为动态SGG的强大场景分析器。该框架引入视频到语言（V2L）映射模块，将视频帧转换为语言信号（场景标记），使输入更易于LLMs理解。为更好编码空间信息，我们受汉字结构启发设计空间信息聚合（SIA）方案，将空间数据编码为标记。通过最优传输（OT）技术，我们从帧级标记序列生成捕获视频时空信息的隐式语言信号。为进一步增强LLM处理此类隐式语言输入的能力，采用低秩自适应（LoRA）方法对模型进行微调。最后使用基于Transformer的SGG预测器解码LLM的推理结果并预测语义三元组。本方法在Action Genome（AG）基准测试中达到最先进水平，大量实验证明SceneLLM在理解和生成精确动态场景图方面的有效性。

Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning

Abstract

arXiv:2504.18827v2 Announce Type: replace-cross Abstract: In-context learning (ICL) has emerged as a powerful capability of large language models (LLMs), enabling them to perform new tasks based on a few provided examples without explicit fine-tuning. Despite their impressive adaptability, these models remain vulnerable to subtle adversarial perturbations and exhibit unpredictable behavior when faced with linguistic variations. Inspired by software testing principles, we introduce a software testing-inspired framework, called MMT4NL, for evaluating the trustworthiness of in-context learning by utilizing adversarial perturbations and software testing techniques. It includes diverse evaluation aspects of linguistic capabilities for testing the ICL capabilities of LLMs. MMT4NL is built around the idea of crafting metamorphic adversarial examples from a test set in order to quantify and pinpoint bugs in the designed prompts of ICL. Our philosophy is to treat any LLM as software and validate its functionalities just like testing the software. Finally, we demonstrate applications of MMT4NL on the sentiment analysis and question-answering tasks. Our experiments could reveal various linguistic bugs in state-of-the-art LLMs.

摘要

语境学习（ICL）已成为大型语言模型（LLM）的一项强大能力，使其能够基于少量提供的示例执行新任务，而无需显式微调。尽管这些模型展现出卓越的适应性，但仍易受微妙对抗性扰动的影响，并在面对语言变异时表现出不可预测的行为。受软件测试原理启发，我们提出一个名为MMT4NL的软件测试框架，通过利用对抗性扰动和软件测试技术来评估语境学习的可信度。该框架包含多样化的语言能力评估维度，用于测试LLM的ICL能力。MMT4NL的核心思想是从测试集中构建蜕变对抗样本，以量化和定位ICL设计提示中的缺陷。我们的理念是将任何LLM视为软件，并通过类似软件测试的方式验证其功能。最后，我们在情感分析和问答任务上展示了MMT4NL的应用。实验结果表明，该方法能够揭示当前最先进LLM中存在的各类语言缺陷。

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

Abstract

arXiv:2503.18892v2 Announce Type: replace-cross Abstract: DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

摘要

DeepSeek-R1研究表明，长链思维（CoT）推理可以通过基于规则奖励的简单强化学习（RL）框架自然涌现，这种训练可以直接从基础模型开始——这一范式被称为零RL训练。近期大多数重现零RL训练的努力主要集中在Qwen2.5模型系列上，但我们发现这些基础模型已具备较强的指令遵循和自我反思能力，因此可能缺乏代表性。本工作研究了10种不同基础模型的零RL训练，涵盖不同家族和规模，包括LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B以及从0.5B到32B的所有Qwen2.5模型。通过采用调整格式奖励和控制查询难度等关键设计策略，我们在大多数设置中实现了推理准确性和响应长度的显著提升。然而，通过仔细监控训练动态，我们发现不同基础模型在训练过程中表现出不同的模式。例如，响应长度的增加并不总是与某些认知行为（如验证，即“顿悟时刻”）的出现相关。值得注意的是，我们首次在非Qwen家族的小模型中观察到了“顿悟时刻”。我们分享了实现成功零RL训练的关键设计，以及研究发现和实践经验。为促进进一步研究，我们开源了代码、模型和分析工具。

RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation

Abstract

arXiv:2505.01709v2 Announce Type: replace-cross Abstract: Operating robots in open-ended scenarios with diverse tasks is a crucial research and application direction in robotics. While recent progress in natural language processing and large multimodal models has enhanced robots' ability to understand complex instructions, robot manipulation still faces the procedural skill dilemma and the declarative skill dilemma in open environments. Existing methods often compromise cognitive and executive capabilities. To address these challenges, in this paper, we propose RoBridge, a hierarchical intelligent architecture for general robotic manipulation. It consists of a high-level cognitive planner (HCP) based on a large-scale pre-trained vision-language model (VLM), an invariant operable representation (IOR) serving as a symbolic bridge, and a generalist embodied agent (GEA). RoBridge maintains the declarative skill of VLM and unleashes the procedural skill of reinforcement learning, effectively bridging the gap between cognition and execution. RoBridge demonstrates significant performance improvements over existing baselines, achieving a 75% success rate on new tasks and an 83% average success rate in sim-to-real generalization using only five real-world data samples per task. This work represents a significant step towards integrating cognitive reasoning with physical execution in robotic systems, offering a new paradigm for general robotic manipulation.

摘要

在开放场景中操作机器人执行多样化任务是机器人技术的重要研究和应用方向。尽管自然语言处理和大规模多模态模型的最新进展提升了机器人理解复杂指令的能力，但机器人在开放环境中的操作仍面临程序性技能困境和陈述性技能困境。现有方法往往难以兼顾认知与执行能力。为解决这些问题，本文提出RoBridge——一种通用机器人操作的层次化智能架构。该架构由基于大规模预训练视觉语言模型（VLM）的高层认知规划器（HCP）、作为符号桥梁的不变可操作表征（IOR）以及通用具身智能体（GEA）组成。RoBridge既保持了VLM的陈述性技能，又释放了强化学习的程序性技能，有效弥合了认知与执行之间的鸿沟。实验表明，RoBridge相较现有基线模型取得显著性能提升：新任务成功率达75%，在每任务仅使用5个真实世界数据样本的情况下，仿真到现实的泛化平均成功率可达83%。这项工作标志着机器人系统在认知推理与物理执行融合方面迈出重要一步，为通用机器人操作提供了新范式。

Liger: Linearizing Large Language Models to Gated Recurrent Structures

Abstract

arXiv:2503.01496v2 Announce Type: replace-cross Abstract: Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93% of the Transformer-based LLM at 0.02% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.

摘要

具有线性循环建模能力的Transformer架构可实现线性时间训练与恒定内存推理。尽管这类非标准架构已展现出高效性与性能优势，但其从头开始的预训练仍存在成本高昂和风险较大的问题。大型语言模型（LLM）的线性化技术能将预训练标准模型转化为线性循环结构，从而提升部署效率。然而现有线性化方法通常需要引入额外的特征映射模块（这些模块需进行大量微调），且忽视了当前最先进线性循环模型中的门控机制。针对这些问题，本文提出Liger（线性化LLM为门控循环结构的简称），这是一种将预训练LLM转换为门控线性循环模型的新方法，无需添加额外参数。该方法通过重新利用预训练键矩阵权重来构建多样化门控机制，既能形成各类门控循环结构，又可避免从头训练附加组件。采用低秩自适应（LoRA）的轻量级微调技术，Liger能使线性化门控循环模型的性能恢复至原始LLM水平。此外，我们提出Liger Attention——一种层内混合注意力机制，在线性化过程中仅需0.02%预训练token即可显著恢复基于Transformer的LLM 93%的性能，在1B至8B参数规模的模型验证中，多个基准测试均取得具有竞争力的结果。代码已开源：https://github.com/OpenSparseLLMs/Linearization。

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Abstract

arXiv:2505.03335v2 Announce Type: replace-cross Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

摘要

增强学习与可验证奖励（RLVR）通过基于结果的直接奖励学习，已展现出提升大语言模型推理能力的潜力。近期零样本环境下的RLVR研究虽避免了对推理过程的标注监督，但仍依赖于人工构建的问题与答案集合进行训练。高质量人类生成样本的稀缺性引发了对其长期可扩展性的担忧——这一问题在语言模型预训练领域已显现端倪。此外，在人工智能超越人类智能的假设未来中，人类提供的任务对超级智能系统的学习潜力可能极为有限。为解决这些问题，我们提出名为"绝对零度"的新型RLVR范式：单个模型通过自主提出能最大化其学习进度的任务并自我求解来实现推理能力进化，全程无需依赖外部数据。基于该范式，我们开发了绝对零度推理器（AZR），该系统利用代码执行器同时验证自主生成的代码推理任务及其答案，以此作为统一的可验证奖励来源，引导开放而严谨的自我学习。尽管完全未使用外部数据训练，AZR在编程与数学推理任务中实现了全面最优性能，超越了依赖数万领域内人工标注样本的现有零样本模型。我们进一步证明AZR可适配不同规模模型架构，且与多种模型类别兼容。

InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

Abstract

arXiv:2502.03885v3 Announce Type: replace-cross Abstract: Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 takes a middle-ground approach by leveraging Optical Circuit Switches, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfiniteHBD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfiniteHBD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt into variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh) based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfiniteHBD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).

摘要

大规模语言模型（LLM）训练的扩展依赖于多维并行技术，其中高带宽域（HBD）对张量并行（TP）和专家并行（EP）等通信密集型并行方式至关重要。然而，现有HBD架构在可扩展性、成本和容错性方面存在根本性局限：以交换机为核心的HBD（如NVL-72）面临极高的扩展成本，而以GPU为核心的HBD（如TPUv3/Dojo）则存在严重的故障传播问题。TPUv4等交换机-GPU混合型HBD采用折中方案，通过光路开关（OCS）实现连接，但其故障影响范围仍维持在立方体级别（如64个TPU）。

我们提出InfiniteHBD——一种创新的以收发器为核心的HBD架构，利用光路开关技术在收发器层面统一实现连接与动态切换。通过在每个收发器中集成OCS，该架构可实现可重构的点对多点连接，使拓扑结构能够自适应调整为可变尺寸环形网络。该设计具有以下特点：i) 实现数据中心级扩展能力且避免成本激增；ii) 通过将故障隔离至单个节点提升容错性；iii) 为无故障GPU提供全带宽利用率。关键技术突破包括：基于硅光子（SiPh）的低成本OCS收发器（OCSTrx）、与节点内/间通信协同设计的可重构k跳环形拓扑，以及能最大化GPU利用率同时最小化跨机架数据中心网络流量的HBD-DCN编排算法。评估表明，InfiniteHBD的成本仅为NVL-72的31%，GPU闲置率趋近于零（比NVL-72和TPUv4低一个数量级），在节点故障率低于7%时跨机架流量趋近于零，与NVIDIA DGX（每节点8 GPU）相比模型浮点运算利用率提升3.37倍。

The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
- Abstract
- 摘要
Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents
- Abstract
- 摘要
MARCO: A Multi-Agent System for Optimizing HPC Code Generation Using Large Language Models
- Abstract
- 摘要
Prism: Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving
- Abstract
- 摘要
LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration
- Abstract
- 摘要
QStore: Quantization-Aware Compressed Model Storage
- Abstract
- 摘要
Can Large Language Models Predict Parallel Code Performance?
- Abstract
- 摘要
TrajEvo: Designing Trajectory Prediction Heuristics via LLM-driven Evolution
- Abstract
- 摘要
Benchmarking LLMs' Swarm intelligence
- Abstract
- 摘要
Promoting Security and Trust on Social Networks: Explainable Cyberbullying Detection Using Large Language Models in a Stream-Based Machine Learning Framework
- Abstract
- 摘要
APSQ: Additive Partial Sum Quantization with Algorithm-Hardware Co-Design
- Abstract
- 摘要
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
- Abstract
- 摘要
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
- Abstract
- 摘要
GPU Performance Portability needs Autotuning
- Abstract
- 摘要
Splitwiser: Efficient LM inference with constrained resources
- Abstract
- 摘要
Calibrating Uncertainty Quantification of Multi-Modal LLMs using Grounding
- Abstract
- 摘要
Large Language Model Compression with Global Rank and Sparsity Optimization
- Abstract
- 摘要
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection
- Abstract
- 摘要
Efficient Fine-Tuning of Quantized Models via Adaptive Rank and Bitwidth
- Abstract
- 摘要
RWKVQuant: Quantizing the RWKV Family with Proxy Guided Hybrid of Scalar and Vector Quantization
- Abstract
- 摘要
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
- Abstract
- 摘要
Grouped Sequency-arranged Rotation: Optimizing Rotation Transformation for Quantization for Free
- Abstract
- 摘要
Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs
- Abstract
- 摘要
MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance
- Abstract
- 摘要
Program Semantic Inequivalence Game with Large Language Models
- Abstract
- 摘要
VideoLLM Benchmarks and Evaluation: A Survey
- Abstract
- 摘要
Memory Assisted LLM for Personalized Recommendation System
- Abstract
- 摘要
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype
- Abstract
- 摘要
Advancing and Benchmarking Personalized Tool Invocation for LLMs
- Abstract
An Empirical Study of OpenAI API Discussions on Stack Overflow
- Abstract
- 摘要
LLMs' Suitability for Network Security: A Case Study of STRIDE Threat Modeling
- Abstract
- 摘要
SLOT: Structuring the Output of Large Language Models
- Abstract
- 摘要
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains
- Abstract
- 摘要
LLM-e Guess: Can LLMs Capabilities Advance Without Hardware Progress?
- Abstract
- 摘要
On-Device LLM for Context-Aware Wi-Fi Roaming
- Abstract
- 摘要
Unmasking the Canvas: A Dynamic Benchmark for Image Generation Jailbreaking and LLM Content Safety
- Abstract
- 摘要
Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering
- Abstract
- 摘要
VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
- Abstract
- 摘要
Weaponizing Language Models for Cybersecurity Offensive Operations: Automating Vulnerability Assessment Report Validation; A Review Paper
- Abstract
- 摘要
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
- Abstract
- 摘要
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
- Abstract
- 摘要
A Large Language Model for Feasible and Diverse Population Synthesis
- Abstract
- 摘要
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
- Abstract
- 摘要
YABLoCo: Yet Another Benchmark for Long Context Code Generation
- Abstract
- 摘要
The Aloe Family Recipe for Open and Specialized Healthcare LLMs
- Abstract
- 摘要
"I Can See Forever!": Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
- Abstract
- 摘要
Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review
- Abstract
- 摘要
Fight Fire with Fire: Defending Against Malicious RL Fine-Tuning via Reward Neutralization
- Abstract
- 摘要
Context-aware LLM-based Safe Control Against Latent Risks
- Abstract
- 摘要
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
- Abstract
- 摘要
Towards a HIPAA Compliant Agentic AI System in Healthcare
- Abstract
- 摘要
Fate: Fast Edge Inference of Mixture-of-Experts Models via Cross-Layer Gate
- Abstract
- 摘要
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
- Abstract
- 摘要
Question-Answering Dense Video Events
- Abstract
- 摘要
Estimating LLM Uncertainty with Logits
- Abstract
- 摘要
CLEAR: Cue Learning using Evolution for Accurate Recognition Applied to Sustainability Data Extraction
- Abstract
- 摘要
A Simple Ensemble Strategy for LLM Inference: Towards More Stable Text Classification
- Abstract
- 摘要
SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
- Abstract
- 摘要
Test It Before You Trust It: Applying Software Testing for Trustworthy In-context Learning
- Abstract
- 摘要
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild
- Abstract
- 摘要
RoBridge: A Hierarchical Architecture Bridging Cognition and Execution for General Robotic Manipulation
- Abstract
- 摘要
Liger: Linearizing Large Language Models to Gated Recurrent Structures
- Abstract
- 摘要
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
- Abstract
- 摘要
InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract